effekt-lang / effekt

A language with lexical effect handlers and lightweight effect polymorphism
https://effekt-lang.org
MIT License
335 stars 24 forks source link

Unicode escapes in the MLton backend #544

Closed jiribenes closed 2 months ago

jiribenes commented 3 months ago

Moved from #542

Motivation

When using characters like "✓" and "✕", MLton reports:

Error: /home/runner/work/effekt/effekt/out/tests/effekt.mltests/lib_test.sml 391.69-391.76.
  String constant with character too large for type: #"\u2715".
    type: string
Error: /home/runner/work/effekt/effekt/out/tests/effekt.mltests/lib_test.sml 561.67-561.74.
  String constant with character too large for type: #"\u2713".
    type: string

Investigation

Here's the source for MLton's lexer which indicates support for \[0-9]{3}, \u[0-9A-F]{4} and \U[0-9A-F]{8} escapes: https://github.com/MLton/mlton/blob/680bfcc6d6d8df3e51220fd88d297830316b89b4/mlton/front-end/ml.lex#L446-L457 but there are no real docs for it, the only thing I found suggests that multi-byte escapes should be escaped to single-bytes (locked under a flag), see http://www.mlton.org/SuccessorML#ExtendedTextConsts

The error itself is defined here: https://github.com/MLton/mlton/blob/680bfcc6d6d8df3e51220fd88d297830316b89b4/mlton/elaborate/elaborate-core.fun#L451-L464, and I think this indicates that we should:

  1. either use a different string type on MLton which supports these large escapes (WideString is UTF-32 [?] as far as I understand http://mlton.org/Unicode)
  2. or escape them byte-by-byte with the \[0-9]{3}-style syntax.

Solution

I think keeping strings UTF-8(-ish) is worth it, so I'd prefer the solution 2, even though it's slightly more work on our part.

Here's the code that needs to change: https://github.com/effekt-lang/effekt/blob/08fc8fdee26d420c05823862ee008092582198dd/effekt/shared/src/main/scala/effekt/generator/ml/Transformer.scala#L637-L638

I think that something like c.toString.getBytes("UTF-8") could be useful here to get a sequence of bytes which then get mapped to the \[0-9]{3} format each.

Testing

It would be very valuable to have a few more test for this behaviour and check that such characters work on all of the different backends.

jiribenes commented 2 months ago

MLton has been deprecated as of #616