dankogai / p5-encode

Encode - character encodings (for Perl 5.8 or better)
https://metacpan.org/release/Encode
37 stars 51 forks source link

Perl 5.30.2 replaces some invalid UTF-8 byte sequences inconsistent with current best practices. #166

Open flenniken opened 2 years ago

flenniken commented 2 years ago

Perl 5.30.2 replaces some invalid UTF-8 byte sequences inconsistent with current best practices.

The Unicode specification says:

An increasing number of implementations are adopting the handling of ill-formed subsequences as specified in the W3C standard for encoding to achieve consistent U+FFFD replacements.

See:

For example, the hex byte sequence:

<e0 80 7f>

gets encoded as:

<ef bf bd 7f>

instead of:

<ef bf bd ef bf bd 7f>

Here are a few more examples:

Perl decode: e0 80 80 expected: ef bf bd ef bf bd ef bf bd got: ef bf bd

Perl decode: f0 80 80 80 expected: ef bf bd ef bf bd ef bf bd ef bf bd got: ef bf bd

Perl decode: ed ae 80 ed b0 80 expected: ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd got: ef bf bd ef bf bd

See https://github.com/flenniken/utf8tests for more information.