UTF-8 text literals are partially escaped in error messages

dram commented 1 year ago

Currently, MLton support UTF-8 text literals with allowExtendedTextConsts enabled, but texts will be partially escaped when appearing in error messages. e.g.:

With single line code:

val _ = "a甲" ^ (Substring.full "b乙")

Compiling:

% mlton -default-ann 'allowExtendedTextConsts true' test.sml
Error: test.sml 1.9-1.40.
  Function applied to incorrect argument.
    expects: _ * [string]
    but got: _ * [char VectorSlice.slice]
    in: ^ ("a\231\148\178", Substring.full "b\228\185\153")

The problem exists both in version 20210117 and master branch.

MatthewFluet commented 1 year ago

I don't know that I would really consider that to be a bug.

There are other places where MLton will report an error message without the exact same text as in the source file:

$ cat z.sml 
fun f (w : word) = w
val _ = f 0xabcd
$ mlton z.sml 
Error: z.sml 2.9-2.16.
  Function applied to incorrect argument.
    expects: [word]
    but got: [int]
    in: f 43981

And it would be somewhat tedious to determine the correct escaping, since a string literal might not be a sequence of UTF-8 bytes. And, the choice of how to display a string literal would depend upon the allowExtendedTextConsts annotation.

dram commented 1 year ago

Seems interesting, I have a look at some other languages. e.g.

GHC:

% cat test.hs
main = putStrLn ("a甲\128\032z" + "b乙")
% ghc test.hs
[1 of 1] Compiling Main             ( test.hs, test.o )

test.hs:1:32: error:
    • No instance for (Num String) arising from a use of ‘+’
    • In the first argument of ‘putStrLn’, namely
        ‘("a甲\128\032z" + "b乙")’
      In the expression: putStrLn ("a甲\128\032z" + "b乙")
      In an equation for ‘main’: main = putStrLn ("a甲\128\032z" + "b乙")
  |
1 | main = putStrLn ("a甲\128\032z" + "b乙")
  |

OCaml:

% cat test.ml
let () = print_endline ("a甲\128\032z" + "b乙")
% ocaml test.ml
File "./test.ml", line 1, characters 24-39:
1 | let () = print_endline ("a甲\128\032z" + "b乙")
                            ^^^^^^^^^^^^^^^
Error: This expression has type string but an expression was expected of type
         int

Seems that both of them show error messages with original source text. I think it is a relatively safe approach. If we assume that source code is properly encoded in UTF-8, then no escaping is needed.

MLton / mlton

UTF-8 text literals are partially escaped in error messages #517