Open skmuduli92 opened 1 month ago
Hi @skmuduli92, thanks for the report! I tried to isolate this problem a bit here:
let s1 = "\163" (* Note: this is decimal character code 163, hex 0xA3, octal 243 *)
let s2 = "a"
let _ = print_int (String.length s1); print_string "\n"
let _ = print_int (String.length s2); print_string "\n"
let _ = print_int (String.length (s1 ^ s2)); print_string "\n"
let _ = print_string "\n"
let _ = print_int (BatUTF8.length s1); print_string "\n"
let _ = print_int (BatUTF8.length s2); print_string "\n"
let _ = print_int (BatUTF8.length (s1 ^ s2)); print_string "\n"
let _ = print_string "\n"
Output:
1
1
2
1
1
1
The first string is not a well-formed UTF-8 string, since 0xA3 = 0b10100011, which is an invalid UTF-8 encoding regardless of what follows (to my understanding). BatUTF8 claims the length is 1, instead of say raising an exception for a bad string (which is arguably a good decision for a production library, but we can call BatUTF8.validate
to confirm it's ill-formed). The usual OCaml strcat (which we use for ^
) does not care about UTF-8 at all and just concats, creating a new string with a single UTF-8 character (it happens to still be invalid, but again BatUTF8 claims the length is 1), showing the problem in your post.
However, we're passing in bad strings to this function. The precondition is that string
is a UTF-8 string, and the call from OCaml code is breaking that assumption. So I don't see an unsoundness just now.
However, I'm not really sure if the libraries are consistent with that assumption... Unicode/UTF-8 support in F* has been lagging a bit.
Did you find any way to trigger this internally from F*? Or by using valid UTF-8 strings?
"...The usual OCaml strcat(which we use for ^) does not care about UTF-8 at all and just concats..."
I have also tried using the FStar_String.concat, the behavior is still same. Is it expected? And the precondition of string being wellformed UTF8, is it implicit ?
Did you find any way to trigger this internally from F*? Or by using valid UTF-8 strings?
"...The usual OCaml strcat(which we use for ^) does not care about UTF-8 at all and just concats..."
I have also tried using the FStar_String.concat, the behavior is still same. Is it expected? And the precondition of string being wellformed UTF8, is it implicit ?
Right, FStar_String.concat is just OCaml's (^). The assumption is implicit in the sense that it's not checked, but also a bit poorly documented. In FStar.String.fsti there's a mention strlen
giving the amount of UTF-8 characters, instead of the byte string. https://github.com/FStarLang/FStar/blob/9820798dcc31cd1ea5c164611a67f58ade0b7655/ulib/FStar.String.fsti#L49-L51
Perhaps we should rename FStar.String to FStar.UTF8String or similar to make it clearer, and maybe alias String to it. We could also have an FStar.ByteString.
Did you find any way to trigger this internally from F*? Or by using valid UTF-8 strings?
- I have not yet tried to reproduce it internally. I will give it a shot.
- The char_code definition in FStar.Char.fsti tells that code 163 is a valid character. Have I used it the right way in the example that I shared?
163 is a valid character (or code point) but in UTF-8 it is not encoded as just one byte. The hex for 163 is 0xa3
, but in UTF-8 this codepoint is encoded as two bytes: 0xc2 0xa3
(with 0xa3
appearing again being sort of a coincidence :) ). In ocaml when you write \163
that means the literal byte 0xa3
, not the utf-8 encoding for it.
More in general this does show a problem in that the UTF-8 requirement is not clear, and not checked for OCaml clients. I think we should do the renaming above, and maybe provide a safe API that does check for well-formedness when the string comes from an OCaml client.
We were testing out some of the ulib ocaml library function for strings generated from its Fstar implementation. For length and concat(^) functions, we found a test input which violates the lemma "length (s1 ^ s2) = length s1 + length s2"
Below is an ocaml example to reproduce this. We tried running it on the latest FStar version.
Output:
The length of the concated string should be 3 but instead it computes 2 which violates the lemma.