Dushistov / flapigen-rs

Tool for connecting programs or libraries written in Rust with other languages
BSD 3-Clause "New" or "Revised" License
775 stars 59 forks source link

Java bindings do not handle Java's "Modified UTF-8" #441

Closed ijc closed 1 year ago

ijc commented 1 year ago

The generated Java bindings use JavaString as part of the transform from a Java/JNI jstring to a Rust String.

This uses the JNI GetStringUTFChars function, converts that to a CStr and finally to a &str with CStr::to_str using unwrap() on the result.

The to_str method requires the CStr to contain valid UTF-8. However GetStringUTFChars docs) say:

Returns a pointer to an array of bytes representing the string in modified UTF-8 encoding.

JNI docs and Wikipedia describe it but in brief:

CStr::to_str does reject at least the 0xc0 0x80 encoding of NULL as can be seen here. I'm unsure what it does with surrogates, but AIUI they are not allowed in proper UTF-8 so I guess it would reject that too.

We discovered this when a user of our Java app accidentally clicked on a binary file instead of the expected CSV file.

What options do we have for handling this more gracefully than an unwrap? Ideally we wouldn't crash the app in this case.

Perhaps https://docs.rs/cesu8/latest/cesu8/fn.from_java_cesu8.html would allow us to handle this without allocating in the case where the modified utf-8 string didn't deviate from proper utf-8? Would you consider using that in the generate code?

For our use case causing the binding function to return a Result::Err would be acceptable, but I don't think the overall flapigen paradigm allows a foreign_typemap conversion to fail?

ijc commented 1 year ago

FYI looks like it dislikes surrogates too: playground.

ijc commented 1 year ago

@Dushistov Would you consider a PR which make JavaString use cesu8::from_java_cesu8::from_java_cesu8, perhaps behind a feature or otherwise configurable?

Dushistov commented 1 year ago

I am obviously glad to accept PR to fix for this issue. But flapigen is build time dependency, not runtime dependency so usage of third party crate is not simple. May be instead use GetStringChars and then String::from_utf16 ?

And may be this method would be faster, then usage of 'cesu8', because of (I suppose) for the most common case are small enough strings, and for them the slowest part of conversation is memory allocation. And path where we make one memory allocation for utf-16 -> utf-8, would be faster then allocation for utf-16 to utf-8 (cesu8) and then memory allocation for conversation to good utf-8.

ijc commented 1 year ago

Thanks for the reply. Your approach of consuming the UTF-16 direct does indeed look more sensible than what I had proposed. I presume doesn't need to be conditional in anyway since it's not adding a dep and is a correctness thing.

I may not get to it right away but I'll raise a PR hopefully sooner rather than later.