crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.36k stars 1.62k forks source link

UTF-16 string literals #14670

Closed straight-shoota closed 3 months ago

straight-shoota commented 3 months ago

When working with Windows APIs, it's common that we need UTF-16 strings (instead of Crystal's String which is UTF-8). String#to_utf16 is available for conversion.

But most use cases of this method in stdlib are actually for string literals (e.g. "Content Type".to_utf16). This is a bit unnecessary because it means the string transformation happens at runtime, while it could be entirely at compile time, avoding extra computation and allocation.

A particularly intricate use case is in #14659 where we must not allocate at all. So it ends up with such a mechanism to achive compile time conversion: UInt16.static_array({% for chr in "CRYSTAL_TRACE".chars %}{{chr.ord}}, {% end %} 0).

This certainly works, at least for this limited use case. But it fails for code points outside the Basic Multilingual Plane. So it's not a generic solution.

It would be nice if we had an easy tool for creating UTF-16 encoded strings.

Maybe the converstion algorithm from String#to_utf16 could be implemented as a macro method? It's a bit complex, but not too much. I don't think we can explicitly do math operations on 16-bit integers in the macro language, though.

An alternative would be to expose a compiler primitive for UTF-16 conversion.

Related: #2886

BlobCodes commented 3 months ago

Maybe the converstion algorithm from String#to_utf16 could be implemented as a macro method? It's a bit complex, but not too much. I don't think we can explicitly do math operations on 16-bit integers in the macro language, though.

Simply porting #14671 works fine, explicit math operations on 16-bit integers are not needed.

class String
  macro utf16_literal(data)
    {%
      arr = [] of NumberLiteral
      data.chars.each do |c|
        c = c.ord
        if c < 0x1_0000
          arr << c
        else
          c -= 0x1_0000
          arr << 0xd800 + ((c >> 10) & 0x3ff)
          arr << 0xdc00 + (c & 0x3ff)
        end
      end
      arr << 0
    %}
    Slice(UInt16).literal({{arr.splat}})[0, {{arr.size - 1}}]
  end
end

s = String.utf16_literal("TEST 😐🐙 ±∀ の")
# => Slice[84, 69, 83, 84, 32, 55357, 56848, 55357, 56345, 32, 177, 8704, 32, 12398]

String.from_utf16(s)
# => "TEST 😐🐙 ±∀ の"

Encoding 10000 characters takes around 300ms. That's certainly not fast, but probably good enough.

EDIT: Added a final 0 byte

straight-shoota commented 3 months ago

Looks like a winner, then 🚀

That's certainly not fast, but probably good enough

Yeah, this is mainly for relatively short strings, so performance should not be an issue. We can always push it up into the compiler if the need arises.

Btw. CharLiteral#ord was only added in 1.11 (#13910), so this wouldn't have been possible before.

straight-shoota commented 3 months ago

In order to make it actually static data, we'd also need a slice literal (#2886).

BlobCodes commented 3 months ago

The version from my comment uses the literals from #13716, so it is static data in this case. Although it is still experimental API.

stakach commented 3 months ago

Worth noting that Windows supports UTF8 now and encourages use of those APIs

https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page#-a-vs--w-apis

So the conversations could be avoided entirely

straight-shoota commented 3 months ago

So the conversations could be avoided entirely

Would be nice. But I believe we're quite a bit away from that. The Windows ecosystem is huge and it has 30 years of wide chars in it.

ysbaddaden commented 3 months ago

@straight-shoota this is reusing the "old" ANSI API to use the UTF-8 codepage, so it might just work :shrug:

It took me a while to find this: at the above link there is the explanation to set the Active Code Page (ACP) to UTF-8 which requires a manifest and calling an EXE to "add the manifest" to an executable. Then the executable the ANSI variant of the Windows API will use UTF-8.

That being said, it requires Windows 10 v1903 (2019) and GDI applications won't support it unless the user activates a beta setting.

ysbaddaden commented 3 months ago

The macro is nice, but if we want to eventually have the compiler optimize it, maybe we could just expose the String.to_utf16 to macros directly? For example {{ "CRYSTAL_TRACE".to_utf16 }} would be lovely & fast.

straight-shoota commented 3 months ago

Hm, that's an interesting idea. Exposing StringLiteral#to_utf16 would certainly have the benefit that you have the resulting literal easily available in macro land. I like that it's exactly identical to the runtime version, but in a macro expansion which makes it clear that this happens at compile time.

FTR: Eventual compiler optimization would also be possible with String.utf16_literal as well. We could turn this macro into a primitive later.

Let's focus on UTF-16 string literals here and continue the discussion about UTF-8 support on Win32 in a different issue. I'm pretty sure we won't lose all use cases for UTF-16 string literals over night, so this will still be useful.

ysbaddaden commented 3 months ago

The difficulty to implement StringLiteral#to_utf16 is that there is no SliceLiteral and we should generate a Slice(UInt16).literal(..., 0) and I have no idea how to achieve that.

BlobCodes commented 3 months ago

The difficulty to implement StringLiteral#to_utf16 is that there is no SliceLiteral and we should generate a Slice(UInt16).literal(..., 0) and I have no idea how to achieve that.

It could return ArrayLiteral(NumberLiteral) (or Call(@receiver=Generic(@name=Slice, @type_vars=[UInt16]) @name="literal", @args=[0, 1, 2, 3, 4, 5, ...]))


Btw I just tested the performance of my macro code a bit more. Simply replacing the line {{ arr.splat }} with {% arr.splat %} 0 (so the resulting splat is not parsed) improves the runtime of encoding 10000 characters from ~300ms to ~20ms.

The macro language actually isn't that slow - the parser is.

Implementing StringLiteral#to_utf16 wouldn't improve performance in a perceivable manner since it would only remove <10% of the runtime.

Maybe there should be a way to create AST nodes directly inside the macro language, so we don't have to parse everything again.

stakach commented 3 months ago

GDI applications won't support it unless the user activates a beta setting.

You can activate the code pages in code, this is how applications like MS Edge browser run. MS Edge being a react native app, so runs using JS and UTF8 (although Microsoft is removing react)

straight-shoota commented 3 months ago

Do we want to proceed with StringLiteral#to_utf16 then? I think it's more elegant than String#utf16_literal (https://github.com/crystal-lang/crystal/issues/14670#issuecomment-2154721723). An apparently more performant (https://github.com/crystal-lang/crystal/pull/14676#issuecomment-2156706765).

bcardiff commented 3 months ago

I like StringLiteral#to_utf16 and if to do that we end up having a SliceLiteral even one without first class syntax yet it would still be a double win. Because then embedding resources could leverage a similar StringLiteral#to_slice in compile-time.