Open ajewellamz opened 3 months ago
It should be possible to fix this in Rust by passing the previous char when decoding in --unicode-char:false
struct DafnyPrintContext {
in_seq: bool,
previousUtf16Char: Option<u16>
}
dafny run -t py --unicode-char:false
will fail, regardless of the dafny code involved, if it includes a string like "𝟁".
That is, it will claim success when generating code, and the python will fail when it tries to compile.
More information on the issue in Python:
If my Dafny source code has strings outside the BMP (ex. var a := "𝟁"
), when I translate to Python with --unicode-char:false
, then I expect the generated Python to contain a series of UTF-16 surrogate pairs representing "𝟁" (i.e. "\uD835\uDFC1"
).
However, Dafny currently writes "𝟁" literally.
My understanding is that --unicode-char:false
means "Dafny strings are sequences of 16-bit UTF-16 code units".
Right now, Dafny code containing a string
var a := "𝟁"
translated to Python with --unicode-char:false
results in
d_0_a_ = _dafny.Seq("𝟁")
However, Python will not interpret "𝟁" as a UTF-16 character by default:
Python 3.12.3 (main, May 9 2024, 13:24:59) [Clang 15.0.0 (clang-1500.3.9.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import _dafny
>>> s = _dafny.Seq("𝟁")
>>> s.Elements
['𝟁']
>>> len(s.Elements)
1
>>> str.encode(s.Elements[0])
b'\xf0\x9d\x9f\x81'
>>> len(str.encode(s.Elements[0]))
4
Python is interpreting this as a single 4-byte UTF-32 character, and not as 2 2-byte UTF-16 characters.
This causes issues when using the DafnyRuntimePython's .string_of
method:
>>> s = _dafny.Seq("𝟁")
>>> _dafny.string_of(s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/lucmcdon/.pyenv/versions/3.12.3/lib/python3.12/site-packages/_dafny/__init__.py", line 28, in string_of
return value.__dafnystr__()
^^^^^^^^^^^^^^^^^^^^
File "/Users/lucmcdon/.pyenv/versions/3.12.3/lib/python3.12/site-packages/_dafny/__init__.py", line 172, in __dafnystr__
return string_from_utf_16(self.Elements)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/lucmcdon/.pyenv/versions/3.12.3/lib/python3.12/site-packages/_dafny/__init__.py", line 24, in string_from_utf_16
return b''.join(ord(c).to_bytes(2, 'little') for c in utf_16_code_units).decode("utf-16-le", errors = 'replace')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/lucmcdon/.pyenv/versions/3.12.3/lib/python3.12/site-packages/_dafny/__init__.py", line 24, in <genexpr>
return b''.join(ord(c).to_bytes(2, 'little') for c in utf_16_code_units).decode("utf-16-le", errors = 'replace')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OverflowError: int too big to convert
So Dafny generates invalid code with --unicode-char:false
because the string it generates is not a "sequence of 16-bit UTF-16 code units".
I would expect that translating "𝟁" with --unicode-char:false
would instead write "\uD835\uDFC1", which is interpreted as a series of UTF-16 code units:
>>> s = _dafny.Seq("\uD835\uDFC1")
>>> s.Elements
['\ud835', '\udfc1']
>>> len(s.Elements[0])
1
>>> _dafny.string_of(s)
'𝟁'
So, this seems to only be an issue with translating characters outside the BMP in strings in Dafny source code to Python (and possibly other runtimes). This can be resolved by writing any character inside a Dafny string that is outside the BMP as a UTF-16 surrogate pair during translation.
Dafny developers can work around this by replacing Unicode characters whose UTF-16 encodings lie outside the BMP in strings inside Dafny source code with UTF-16 surrogate pairs. The MPL will work around this in https://github.com/aws/aws-cryptographic-material-providers-library/pull/672
Dafny version
4.7.0 (master as of 29 Aug, except for Rust which is latest feat_rust)
Code to produce this issue
Command to run and resulting output
What happened?
I would expect all languages to behave as cs.
What type of operating system are you experiencing the problem on?
Mac