Kotlin / kotlinx-io

Kotlin multiplatform I/O library
Apache License 2.0
1.28k stars 57 forks source link

Map each code point of an ill-formed UTF-8 subsequence to a replacement character individually #301

Open fzhinkin opened 6 months ago

fzhinkin commented 6 months ago

As it was pointed out in https://github.com/Kotlin/kotlinx-io/pull/290#discussion_r1567268068, kotlinx-io converts different ill-formed UTF-8 subsequences differently: either the whole multi-code-point subsequence replaced with a single replacement character, or each code points is converted separately:

The UTF-8 spec allows handling these ill-formed sequences whatever way we want as long as errors are somehow reported. However, such behavior looks a bit inconsistent and it's hard to reason about how an arbitrary byte sequences will be converted.

We should improve the way ill-formed sequences are handled and stick to an approach adopted by other languages/libraries: convert only ill-formed subsequences consisting of a single byte.

That's how it's done in:

����

ilya-g commented 6 months ago

See also the recommendation "U+FFFD Substitution of Maximal Subparts" in https://www.unicode.org/versions/Unicode15.1.0/ch03.pdf

fzhinkin commented 1 month ago

It seems like kotlinx-io behavior could be aligned w/ Kotlin Stdlib (ByteArray.decodeToString in particular) in all scenarios except surrogate code points handling. On JVM, byte-sequences encoding surrogate code points are replaced with a single , on all other platforms with ���:

ubyteArrayOf(0xedu, 0xbfu, 0xbfu).asByteArray().decodeToString()

https://pl.kotl.in/LMjgmMVGX