Closed ericovva closed 7 years ago
According to Unicode Standard, Version 10.0, Section 3.9 Unicode Encoding Forms, Table 3.7. Well-Formed UTF-8 Byte Sequences, bytes 00..7F are well-formed which corespondents to Unicode code points U+0000..U+007F.
Therefore change in this pull request is against Unicode standard and I would suggest to not merge it in current form.
When we retrieve this file from storage and want to encode_json structure, that contains filename, we get the error: "malformed or illegal unicode character in string [�■■?], cannot convert to JSON"
This looks like a fully different problem -- trying to decode arbitrary bytes from filename as UTF-8 sequence, which does not have work. For encoding such bytes which may have valid UTF-8 subsequence can be used FB_PERLQQ
CHECK parameter for Encode::decode
. See documentation https://metacpan.org/pod/Encode#FB_PERLQQ-FB_HTMLCREF-FB_XMLCREF
Example:
use Encode;
my $unicode_filename = Encode::decode('UTF-8', $byte_filename, Encode::FB_PERLQQ);
As there is no progress, @dankogai would you close and drop this pull request?
Dropping & Closing.
Thank you for fast answer, this function filter malformed characters from string and I'm agree with you, that this function must not to be the part of Encode module - which is part of core perl modules!
This function filter malformed characters (Invalid code points) from utf8 string (https://en.wikipedia.org/wiki/UTF-8, https://tools.ietf.org/html/rfc3629). Use case: user create a file in profile with filename, that consists malformed characters. When we retrieve this file from storage and want to encode_json structure, that contains filename, we get the error: "malformed or illegal unicode character in string [■■?], cannot convert to JSON"
Some have questions: https://stackoverflow.com/questions/6234386/how-do-i-sanitize-invalid-utf-8-in-perl