When using characters like "✓" and "✕", MLton reports:
Error: /home/runner/work/effekt/effekt/out/tests/effekt.mltests/lib_test.sml 391.69-391.76.
String constant with character too large for type: #"\u2715".
type: string
Error: /home/runner/work/effekt/effekt/out/tests/effekt.mltests/lib_test.sml 561.67-561.74.
String constant with character too large for type: #"\u2713".
type: string
either use a different string type on MLton which supports these large escapes (WideString is UTF-32 [?] as far as I understand http://mlton.org/Unicode)
or escape them byte-by-byte with the \[0-9]{3}-style syntax.
Solution
I think keeping strings UTF-8(-ish) is worth it, so I'd prefer the solution 2, even though it's slightly more work on our part.
I think that something like c.toString.getBytes("UTF-8") could be useful here to get a sequence of bytes which then get mapped to the \[0-9]{3} format each.
Testing
It would be very valuable to have a few more test for this behaviour and check that such characters work on all of the different backends.
Moved from #542
Motivation
When using characters like
"✓"
and"✕"
, MLton reports:Investigation
Here's the source for MLton's lexer which indicates support for
\[0-9]{3}
,\u[0-9A-F]{4}
and\U[0-9A-F]{8}
escapes: https://github.com/MLton/mlton/blob/680bfcc6d6d8df3e51220fd88d297830316b89b4/mlton/front-end/ml.lex#L446-L457 but there are no real docs for it, the only thing I found suggests that multi-byte escapes should be escaped to single-bytes (locked under a flag), see http://www.mlton.org/SuccessorML#ExtendedTextConstsThe error itself is defined here: https://github.com/MLton/mlton/blob/680bfcc6d6d8df3e51220fd88d297830316b89b4/mlton/elaborate/elaborate-core.fun#L451-L464, and I think this indicates that we should:
\[0-9]{3}
-style syntax.Solution
I think keeping strings UTF-8(-ish) is worth it, so I'd prefer the solution 2, even though it's slightly more work on our part.
Here's the code that needs to change: https://github.com/effekt-lang/effekt/blob/08fc8fdee26d420c05823862ee008092582198dd/effekt/shared/src/main/scala/effekt/generator/ml/Transformer.scala#L637-L638
I think that something like
c.toString.getBytes("UTF-8")
could be useful here to get a sequence of bytes which then get mapped to the\[0-9]{3}
format each.Testing
It would be very valuable to have a few more test for this behaviour and check that such characters work on all of the different backends.