dafny-lang / dafny

Dafny is a verification-aware programming language
https://dafny.org
Other
2.94k stars 263 forks source link

Dafny fails with most languages when using UTF-16 escape sequences #5737

Open ajewellamz opened 3 months ago

ajewellamz commented 3 months ago

Dafny version

4.7.0 (master as of 29 Aug, except for Rust which is latest feat_rust)

Code to produce this issue

module foo {
  method bar() {
    var decoded := "\uD835\uDFC1";
    var expected_redecoded := "𝟁";
    print decoded, "\n", expected_redecoded, "\n";
    print |decoded|, "\n", |expected_redecoded|, "\n";
    assert decoded == expected_redecoded;
    expect "\uD835\uDFC1" ==  "𝟁";
  }

  method Main() {
    print "Hello World!\n";
    bar();
  }
}

Command to run and resulting output

This command succeeds as expected :
dafny run -t cs --unicode-char:false foo.dfy

js also works correctly.

With a target of python or rust, we get weird runtime errors.

With java, it fails to run, for reasons that don't seem string related.

go mostly works, but prints `��` instead of `𝟁`

With cpp a compiler error "invalid universal character"

What happened?

I would expect all languages to behave as cs.

What type of operating system are you experiencing the problem on?

Mac

MikaelMayer commented 3 months ago

It should be possible to fix this in Rust by passing the previous char when decoding in --unicode-char:false

struct DafnyPrintContext {
    in_seq: bool,
    previousUtf16Char: Option<u16>
}
ajewellamz commented 3 months ago

dafny run -t py --unicode-char:false will fail, regardless of the dafny code involved, if it includes a string like "𝟁". That is, it will claim success when generating code, and the python will fail when it tries to compile.

lucasmcdonald3 commented 3 months ago

More information on the issue in Python:

If my Dafny source code has strings outside the BMP (ex. var a := "𝟁"), when I translate to Python with --unicode-char:false, then I expect the generated Python to contain a series of UTF-16 surrogate pairs representing "𝟁" (i.e. "\uD835\uDFC1"). However, Dafny currently writes "𝟁" literally.

My understanding is that --unicode-char:false means "Dafny strings are sequences of 16-bit UTF-16 code units". Right now, Dafny code containing a string

var a :=  "𝟁"

translated to Python with --unicode-char:false results in

d_0_a_ = _dafny.Seq("𝟁")

However, Python will not interpret "𝟁" as a UTF-16 character by default:

Python 3.12.3 (main, May  9 2024, 13:24:59) [Clang 15.0.0 (clang-1500.3.9.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import _dafny
>>> s = _dafny.Seq("𝟁")
>>> s.Elements
['𝟁']
>>> len(s.Elements)
1
>>> str.encode(s.Elements[0])
b'\xf0\x9d\x9f\x81'
>>> len(str.encode(s.Elements[0]))
4

Python is interpreting this as a single 4-byte UTF-32 character, and not as 2 2-byte UTF-16 characters. This causes issues when using the DafnyRuntimePython's .string_of method:

>>> s = _dafny.Seq("𝟁")
>>> _dafny.string_of(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/lucmcdon/.pyenv/versions/3.12.3/lib/python3.12/site-packages/_dafny/__init__.py", line 28, in string_of
    return value.__dafnystr__()
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/lucmcdon/.pyenv/versions/3.12.3/lib/python3.12/site-packages/_dafny/__init__.py", line 172, in __dafnystr__
    return string_from_utf_16(self.Elements)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lucmcdon/.pyenv/versions/3.12.3/lib/python3.12/site-packages/_dafny/__init__.py", line 24, in string_from_utf_16
    return b''.join(ord(c).to_bytes(2, 'little') for c in utf_16_code_units).decode("utf-16-le", errors = 'replace')
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lucmcdon/.pyenv/versions/3.12.3/lib/python3.12/site-packages/_dafny/__init__.py", line 24, in <genexpr>
    return b''.join(ord(c).to_bytes(2, 'little') for c in utf_16_code_units).decode("utf-16-le", errors = 'replace')
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OverflowError: int too big to convert

So Dafny generates invalid code with --unicode-char:false because the string it generates is not a "sequence of 16-bit UTF-16 code units".

I would expect that translating "𝟁" with --unicode-char:false would instead write "\uD835\uDFC1", which is interpreted as a series of UTF-16 code units:

>>> s = _dafny.Seq("\uD835\uDFC1")
>>> s.Elements
['\ud835', '\udfc1']
>>> len(s.Elements[0])
1
>>> _dafny.string_of(s)
'𝟁'

So, this seems to only be an issue with translating characters outside the BMP in strings in Dafny source code to Python (and possibly other runtimes). This can be resolved by writing any character inside a Dafny string that is outside the BMP as a UTF-16 surrogate pair during translation.

lucasmcdonald3 commented 2 months ago

Dafny developers can work around this by replacing Unicode characters whose UTF-16 encodings lie outside the BMP in strings inside Dafny source code with UTF-16 surrogate pairs. The MPL will work around this in https://github.com/aws/aws-cryptographic-material-providers-library/pull/672