dankogai / p5-encode

Encode - character encodings (for Perl 5.8 or better)
https://metacpan.org/release/Encode
37 stars 51 forks source link

Utf sanitize #120

Closed ericovva closed 7 years ago

ericovva commented 7 years ago

This function filter malformed characters (Invalid code points) from utf8 string (https://en.wikipedia.org/wiki/UTF-8, https://tools.ietf.org/html/rfc3629). Use case: user create a file in profile with filename, that consists malformed characters. When we retrieve this file from storage and want to encode_json structure, that contains filename, we get the error: "malformed or illegal unicode character in string [￿■■?], cannot convert to JSON"

Some have questions: https://stackoverflow.com/questions/6234386/how-do-i-sanitize-invalid-utf-8-in-perl

pali commented 7 years ago

According to Unicode Standard, Version 10.0, Section 3.9 Unicode Encoding Forms, Table 3.7. Well-Formed UTF-8 Byte Sequences, bytes 00..7F are well-formed which corespondents to Unicode code points U+0000..U+007F.

Therefore change in this pull request is against Unicode standard and I would suggest to not merge it in current form.

pali commented 7 years ago

When we retrieve this file from storage and want to encode_json structure, that contains filename, we get the error: "malformed or illegal unicode character in string [�■■?], cannot convert to JSON"

This looks like a fully different problem -- trying to decode arbitrary bytes from filename as UTF-8 sequence, which does not have work. For encoding such bytes which may have valid UTF-8 subsequence can be used FB_PERLQQ CHECK parameter for Encode::decode. See documentation https://metacpan.org/pod/Encode#FB_PERLQQ-FB_HTMLCREF-FB_XMLCREF

Example:

use Encode;
my $unicode_filename = Encode::decode('UTF-8', $byte_filename, Encode::FB_PERLQQ);
pali commented 7 years ago

As there is no progress, @dankogai would you close and drop this pull request?

dankogai commented 7 years ago

Dropping & Closing.

ericovva commented 7 years ago

Thank you for fast answer, this function filter malformed characters from string and I'm agree with you, that this function must not to be the part of Encode module - which is part of core perl modules!