dankogai / p5-encode

Encode - character encodings (for Perl 5.8 or better)
https://metacpan.org/release/Encode
37 stars 51 forks source link

FB_CROAK fails to parse constant string #154

Closed FGasper closed 3 years ago

FGasper commented 3 years ago
> perl -MEncode -e'my $s = Encode::decode_utf8("…", Encode::FB_CROAK);'
Modification of a read-only value attempted at -e line 1.

I found nothing in the documentation about this, and the exception is thrown only when FB_CROAK is given, so … bug?

pali commented 3 years ago

I found nothing in the documentation about this

There is a bold and BIG CAVEAT in Encode::decode_utf8 documentation:

CAVEAT: the input $octets might be modified in-place depending on what is set in CHECK. See "LEAVE_SRC" if you want your inputs to be left unchanged.

FGasper commented 3 years ago

Ah ok. FWIW, I think that’s a bit obtuse; it’d be helpful if somewhere in the POD it mentioned the “modification of a read-only” error specifically. But maybe not many folks encounter this? (If you’re amenable I’d be happy to submit a PR.)

I’m drafting a blog post about Perl encoding and trying to show the best way to decode UTF-8. I was hoping I could use decode_utf8("\xff\xff"), but that quietly accepts the invalid input. So I tried decode_utf8("\xff\xff", Encode::FB_CROAK), and got this. So I guess decode_utf8("\xff\xff", Encode::FB_CROAK | Encode::LEAVE_SRC) is it. That’s rather a lot of typing for such a commonplace operation. Alas, utf8::decode() doesn’t even indicate where the failure is, and Unicode::UTF8 seems to lack any exception-throwing option.

I guess it can’t be changed without breaking things (?), but if the input string is read-only, would it make sense just to copy the string and alter that copy?

pali commented 3 years ago

I’m drafting a blog post about Perl encoding and trying to show the best way to decode UTF-8. I was hoping I could use decode_utf8("\xff\xff"), but that quietly accepts the invalid input.

Please really read documentation for decode_utf8! There is another big warning:

WARNING: do not use this function for data exchange ... use $string = decode("UTF-8", $octets [, CHECK]).

FGasper commented 3 years ago

@pali Good tip … but that means it’s even more prolix:

decode("UTF-8", "\xff\xff", Encode::FB_CROAK | Encode::LEAVE_SRC)

What are some ways that decode_utf8() can do the wrong thing? To be honest I don’t really grok what it’s trying to say: it can produce $string with not strict utf8 representation … isn’t the point of a decoder to produce decoded characters, not UTF-8?

Without meaning to belabour the point: if the input string is read-only, would it make sense just to copy the string and alter that copy, rather than throwing an exception?

pali commented 3 years ago

isn’t the point of a decoder to produce decoded characters, not UTF-8?

Ok, I agree that message should be changed s/it can produce/it can accept and decode/.

if the input string is read-only, would it make sense just to copy the string and alter that copy, rather than throwing an exception?

this is api change. function always worked in current way... and moreover it would cause other new issues (what if you pass and object with overloaded stringification function where you can or cannot modify value? and things which cannot be detected...).

FGasper commented 3 years ago

I agree that message should be changed

I’m happy to submit a PR. If you can point me to a specific example of where decode_utf8() does the wrong thing I’ll include that as well.

it would cause other new issues

I figured that had been considered; thank you for your response.

pali commented 3 years ago

If you can point me to a specific example of where decode_utf8() does the wrong thing I’ll include that as well.

Have you read Encode documentation? If not, please really read it! As the issue about which you are takling is already documented.

FGasper commented 3 years ago

I’ve read it, yes. Nothing in it—including the section you linked—tells me specifically what decode_utf8() does that’s improper, unsafe, etc.

It probably seems clear to you, but you’re also the module’s longtime maintainer.

FWIW I would expect to see something like this under decode_utf8()’s documentation:

WARNING: This function uses L<outdated decoding logic|/UTF-8 vs. utf8 vs. UTF8>. Do not use this function for data exchange …

pali commented 3 years ago

Well, I'm not maintainer... But documentation already describe that decode_utf8 can produce $string with not strict utf8 representation. strict utf-8 is perl term which is already described in Encode documentation. Also Perl's UNICODE behavior is described in perlunicode documenatation. I do not agree that it is outdated decoding logic. It is standard Perl behavior for Perl UNICODE.

FGasper commented 3 years ago

I agree that the documentation discusses the differences among Perl’s m/UTF-?8/i variants. I just think it would be helpful if the documentation more clearly linked that section as the rationale for “do not use this function for data exchange”.

Anyhow, thank you for your responses. I’ll close this thread.

pali commented 3 years ago

If you have a time, you can try to add links. There is perlunicode, Encode and maybe also other documentation files...