FStarLang / FStar

A Proof-oriented Programming Language
https://www.fstar-lang.org
Apache License 2.0
2.65k stars 232 forks source link

String length lemma violation in the generate ocaml code ! #3296

Open skmuduli92 opened 1 month ago

skmuduli92 commented 1 month ago

We were testing out some of the ulib ocaml library function for strings generated from its Fstar implementation. For length and concat(^) functions, we found a test input which violates the lemma "length (s1 ^ s2) = length s1 + length s2"

Below is an ocaml example to reproduce this. We tried running it on the latest FStar version.

open FStar_String

let target s1 s2 =
        let s1_len = (FStar_String.length s1) in
        let s2_len = (FStar_String.length s2) in
        let s1s2_len = (FStar_String.strlen (FStar_String.strcat s1 s2)) in
        Printf.printf "s1_len: %d\n" (Z.to_int s1_len);                                                                                                                             
        Printf.printf "s2_len: %d\n" (Z.to_int s2_len);                                                                                                                             
        Printf.printf "s1s2_len: %d\n" (Z.to_int s1s2_len);                                                                                                                         
        if s1s2_len <> (Z.add s1_len s2_len) then
                failwith "check failed: (strlen (concat s1 s2)) = (strlen s1) + (strlen s2)"

let () = target "{\163" "\004"

Output:

s1_len: 2
s2_len: 1
s1s2_len: 2
Fatal error: exception Failure("check failed: (strlen (concat s1 s2)) = (strlen s1) + (strlen s2)")

The length of the concated string should be 3 but instead it computes 2 which violates the lemma.

mtzguido commented 1 month ago

Hi @skmuduli92, thanks for the report! I tried to isolate this problem a bit here:

let s1 = "\163" (* Note: this is decimal character code 163, hex 0xA3, octal 243 *)
let s2 = "a"

let _ = print_int (String.length s1); print_string "\n"
let _ = print_int (String.length s2); print_string "\n"
let _ = print_int (String.length (s1 ^ s2)); print_string "\n"
let _ = print_string "\n"

let _ = print_int (BatUTF8.length s1); print_string "\n"
let _ = print_int (BatUTF8.length s2); print_string "\n"
let _ = print_int (BatUTF8.length (s1 ^ s2)); print_string "\n"
let _ = print_string "\n"

Output:

1
1
2

1
1
1

The first string is not a well-formed UTF-8 string, since 0xA3 = 0b10100011, which is an invalid UTF-8 encoding regardless of what follows (to my understanding). BatUTF8 claims the length is 1, instead of say raising an exception for a bad string (which is arguably a good decision for a production library, but we can call BatUTF8.validate to confirm it's ill-formed). The usual OCaml strcat (which we use for ^) does not care about UTF-8 at all and just concats, creating a new string with a single UTF-8 character (it happens to still be invalid, but again BatUTF8 claims the length is 1), showing the problem in your post.

However, we're passing in bad strings to this function. The precondition is that string is a UTF-8 string, and the call from OCaml code is breaking that assumption. So I don't see an unsoundness just now.

However, I'm not really sure if the libraries are consistent with that assumption... Unicode/UTF-8 support in F* has been lagging a bit.

Did you find any way to trigger this internally from F*? Or by using valid UTF-8 strings?

skmuduli92 commented 1 month ago

"...The usual OCaml strcat(which we use for ^) does not care about UTF-8 at all and just concats..."

I have also tried using the FStar_String.concat, the behavior is still same. Is it expected? And the precondition of string being wellformed UTF8, is it implicit ?

Did you find any way to trigger this internally from F*? Or by using valid UTF-8 strings?

mtzguido commented 1 month ago

"...The usual OCaml strcat(which we use for ^) does not care about UTF-8 at all and just concats..."

I have also tried using the FStar_String.concat, the behavior is still same. Is it expected? And the precondition of string being wellformed UTF8, is it implicit ?

Right, FStar_String.concat is just OCaml's (^). The assumption is implicit in the sense that it's not checked, but also a bit poorly documented. In FStar.String.fsti there's a mention strlen giving the amount of UTF-8 characters, instead of the byte string. https://github.com/FStarLang/FStar/blob/9820798dcc31cd1ea5c164611a67f58ade0b7655/ulib/FStar.String.fsti#L49-L51

Perhaps we should rename FStar.String to FStar.UTF8String or similar to make it clearer, and maybe alias String to it. We could also have an FStar.ByteString.

Did you find any way to trigger this internally from F*? Or by using valid UTF-8 strings?

  • I have not yet tried to reproduce it internally. I will give it a shot.
  • The char_code definition in FStar.Char.fsti tells that code 163 is a valid character. Have I used it the right way in the example that I shared?

163 is a valid character (or code point) but in UTF-8 it is not encoded as just one byte. The hex for 163 is 0xa3, but in UTF-8 this codepoint is encoded as two bytes: 0xc2 0xa3 (with 0xa3 appearing again being sort of a coincidence :) ). In ocaml when you write \163 that means the literal byte 0xa3, not the utf-8 encoding for it.

More in general this does show a problem in that the UTF-8 requirement is not clear, and not checked for OCaml clients. I think we should do the renaming above, and maybe provide a safe API that does check for well-formedness when the string comes from an OCaml client.