Perl / perl5

đŸȘ The Perl programming language
https://dev.perl.org/perl5/
Other
1.94k stars 553 forks source link

pack propagates UTF8 strings #13989

Open p5pRT opened 10 years ago

p5pRT commented 10 years ago

Migrated from rt.perl.org#122308 (status was 'open')

Searchable as RT122308$

p5pRT commented 10 years ago

From @mikelward

I know this is a consequence of an intended change in 5.10\, but it breaks existing code in non-obvious ways - there should at least be documentation\, if not an option to pack() to change this behavior.

If you pass a UTF8 encoded string to pack() the output of pack() is now a UTF8 encoded string. When the packed data is passed to another program the entire string gets encoded\, including any binary data along with it.

Consider the attached program. It packs an integer and a string and sends the result over a message queue. Because the packed string has the UTF8 encoded attribute it gets encoded and the received string does not unpack to the correct values.

Having pack() handle UTF8 strings the way it does is only useful if the packed data is not sent outside the process\, which is sort of the whole reason for pack in the first place...

p5pRT commented 10 years ago

From @mikelward

utf8pack.pl

p5pRT commented 10 years ago

From @demerphq

On 16 July 2014 22​:18\, Mike L \perlbug\-followup@​perl\.org wrote​:

# New Ticket Created by Mike L # Please include the string​: [perl #122308] # in the subject line of all future correspondence about this issue. # \<URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=122308 >

I know this is a consequence of an intended change in 5.10\, but it breaks existing code in non-obvious ways - there should at least be documentation\, if not an option to pack() to change this behavior.

If you pass a UTF8 encoded string to pack() the output of pack() is now a UTF8 encoded string. When the packed data is passed to another program the entire string gets encoded\, including any binary data along with it.

Consider the attached program. It packs an integer and a string and sends the result over a message queue. Because the packed string has the UTF8 encoded attribute it gets encoded and the received string does not unpack to the correct values.

Having pack() handle UTF8 strings the way it does is only useful if the packed data is not sent outside the process\, which is sort of the whole reason for pack in the first place...

FWIW\, I concur. $work has been bitten by this more than it should.

This change has caused far far more trouble than it should have\, was backwards incompatible\, and really makes pack less than useful for its intended purpose\, data exchange between internal and external systems.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 10 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 10 years ago

From @ikegami

Sounds like you're saying there's a bug in msgsnd? If so\, the solution is to fix msgsnd.

On Thu\, Jul 17\, 2014 at 3​:27 AM\, demerphq \demerphq@&#8203;gmail\.com wrote​:

On 16 July 2014 22​:18\, Mike L \perlbug\-followup@&#8203;perl\.org wrote​:

# New Ticket Created by Mike L # Please include the string​: [perl #122308] # in the subject line of all future correspondence about this issue. # \<URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=122308 >

I know this is a consequence of an intended change in 5.10\, but it breaks existing code in non-obvious ways - there should at least be documentation\, if not an option to pack() to change this behavior.

If you pass a UTF8 encoded string to pack() the output of pack() is now a UTF8 encoded string. When the packed data is passed to another program the entire string gets encoded\, including any binary data along with it.

Consider the attached program. It packs an integer and a string and sends the result over a message queue. Because the packed string has the UTF8 encoded attribute it gets encoded and the received string does not unpack to the correct values.

Having pack() handle UTF8 strings the way it does is only useful if the packed data is not sent outside the process\, which is sort of the whole reason for pack in the first place...

FWIW\, I concur. $work has been bitten by this more than it should.

This change has caused far far more trouble than it should have\, was backwards incompatible\, and really makes pack less than useful for its intended purpose\, data exchange between internal and external systems.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 10 years ago

From @ikegami

On Thu\, Jul 17\, 2014 at 10​:21 AM\, Eric Brine \ikegami@&#8203;adaelis\.com wrote​:

Sounds like you're saying there's a bug in msgsnd? If so\, the solution is to fix msgsnd.

Confirmed.

const char * const mbuf = SvPV_const(mstr\, len);

should be

const char * const mbuf = SvPVbyte_const(mstr\, len);

(SvPVbyte_const doesn't currently exist.)

p5pRT commented 10 years ago

From @demerphq

On 17 July 2014 16​:36\, Eric Brine \ikegami@&#8203;adaelis\.com wrote​:

On Thu\, Jul 17\, 2014 at 10​:21 AM\, Eric Brine \ikegami@&#8203;adaelis\.com wrote​:

Sounds like you're saying there's a bug in msgsnd? If so\, the solution is to fix msgsnd.

Confirmed.

const char * const mbuf = SvPV_const(mstr\, len);

should be

const char * const mbuf = SvPVbyte_const(mstr\, len);

(SvPVbyte_const doesn't currently exist.)

That may be. But the behaviour of pack doesn't make any sense to anyone who uses pack for what it is intended for.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 10 years ago

From @mikelward

On Thu Jul 17 13​:08​:45 2014\, demerphq wrote​:

That may be. But the behaviour of pack doesn't make any sense to anyone who uses pack for what it is intended for.

My point exactly. If pack('a8') puts 8 characters (but 9 bytes because they're UTF8 characters) what is msgsend supposed to do about it? Pack is supposed to let you control the layout of data.

p5pRT commented 10 years ago

From @ap

* Mike L via RT \perlbug\-followup@&#8203;perl\.org [2014-07-18 01​:05]​:

My point exactly. If pack('a8') puts 8 characters (but 9 bytes because they're UTF8 characters) what is msgsend supposed to do about it? Pack is supposed to let you control the layout of data.

Same thing everything is supposed to do about it. Decode the buffer and send the character code points as byte values; *if* they fit into bytes\, that is\, otherwise warn and send the UTF8-encoded raw data – just like e.g. print does.

I mean\, what do you expect pack to do when you give it a UTF8=on string? If you start with data like that and you intend to end in I/O\, *someone* is going to have to ask for bytes from it eventually\, and that someone is going to do exactly what I outlined.

That’s how strings work in Perl (since 5.8 anyhow).

And if msgsend doesn’t do that then it’s clearly broken\, since the data it’s fed need not come from pack. It can always receive UTF8=on strings from somewhere\, and in that situation it needs to DTRT regardless.

There’s an argument to be had about whether pack ought to do the same with UTF8=on strings\, but the bug in msgsend is incontrovertible.

Regards\, -- Aristotle Pagaltzis // \<http​://plasmasturm.org/>

p5pRT commented 10 years ago

From @mikelward

On Thu Jul 17 18​:30​:49 2014\, aristotle wrote​:

Same thing everything is supposed to do about it. Decode the buffer and send the character code points as byte values; *if* they fit into bytes\, that is\, otherwise warn and send the UTF8-encoded raw data – just like e.g. print does.

Yes\, and that's what it does now. That's why I don't think msgsnd is the problem - pack is. pack is supposed to pack things so the user can lay out the bytes the way they want them. If pack is going to produce a UTF8 string as it's output\, which will then have its bytes manipulated\, then pack() is not doing what it was designed to do - layout the bytes the way the caller specified.

p5pRT commented 10 years ago

From @ikegami

On Thu\, Jul 17\, 2014 at 4​:08 PM\, demerphq \demerphq@&#8203;gmail\.com wrote​:

That may be. But the behaviour of pack doesn't make any sense to anyone who uses pack for what it is intended for.

"Doesn't exibit optimal performance" is a far cry from "doesn't make any sense".

A non-fatal sv_downgrade on exit should be ok\, but I seem to remember you wanted to do something quite different.

p5pRT commented 10 years ago

From @ikegami

On Thu\, Jul 17\, 2014 at 4​:53 PM\, Mike L via RT \perlbug\-followup@&#8203;perl\.org wrote​:

On Thu Jul 17 13​:08​:45 2014\, demerphq wrote​:

That may be. But the behaviour of pack doesn't make any sense to anyone who uses pack for what it is intended for.

My point exactly. If pack('a8') puts 8 characters (but 9 bytes because they're UTF8 characters) what is msgsend supposed to do about it? Pack is supposed to let you control the layout of data.

The string only has 8 bytes. Your code even prints them out. msgsend is suppose to send those bytes. It's currently sending some of the internal structure of a scalar instead of the string. That's the bug.

p5pRT commented 10 years ago

From @ap

* Mike L via RT \perlbug\-followup@&#8203;perl\.org [2014-07-18 07​:35]​:

On Thu Jul 17 18​:30​:49 2014\, aristotle wrote​:

Same thing everything is supposed to do about it. Decode the buffer and send the character code points as byte values; *if* they fit into bytes\, that is\, otherwise warn and send the UTF8-encoded raw data – just like e.g. print does.

Yes\, and that's what it does now.

You are welcome to think so. The source code says otherwise.

That's why I don't think msgsnd is the problem - pack is. pack is supposed to pack things so the user can lay out the bytes the way they want them. If pack is going to produce a UTF8 string as it's output\, which will then have its bytes manipulated\, then pack() is not doing what it was designed to do - layout the bytes the way the caller specified.

The only catch being that Perl doesn’t have byte strings.

(It ought to\, I’m inclined to think. But status quo is it doesn’t.)

Regards\, -- Aristotle Pagaltzis // \<http​://plasmasturm.org/>

p5pRT commented 10 years ago

From @demerphq

On 18 July 2014 03​:30\, Aristotle Pagaltzis \pagaltzis@&#8203;gmx\.de wrote​:

* Mike L via RT \perlbug\-followup@&#8203;perl\.org [2014-07-18 01​:05]​:

My point exactly. If pack('a8') puts 8 characters (but 9 bytes because they're UTF8 characters) what is msgsend supposed to do about it? Pack is supposed to let you control the layout of data.

Same thing everything is supposed to do about it. Decode the buffer and send the character code points as byte values; *if* they fit into bytes\, that is\, otherwise warn and send the UTF8-encoded raw data – just like e.g. print does.

Look\, there is a serious bug if by packing a string it *upgrades* the rest of the packed structure. Suddenly a perfectly fine 4 byte integer turns into some UTF8 corrupted crap.

I mean\, what do you expect pack to do when you give it a UTF8=on string?

Exactly what used to happen before we introduced this backwards incompatible behavior *which makes no sense anyway*\, concat the octets in the string into the pack buffer and assume the person calling pack knows what they are doing.

If you start with data like that and you intend to end in I/O\, *someone* is going to have to ask for bytes from it eventually\, and that someone is going to do exactly what I outlined.

I wonder if you even understand the problem the OP is talking about\, and if you ever actually *use* pack for interoperability between discrete systems. Because what you say makes me think you don't really understand the problem here at all.

That’s how strings work in Perl (since 5.8 anyhow).

No. Its not how pack worked for a long time. The 5.8 behavior was sane.

And if msgsend doesn’t do that then it’s clearly broken\, since the data it’s fed need not come from pack. It can always receive UTF8=on strings from somewhere\, and in that situation it needs to DTRT regardless.

There’s an argument to be had about whether pack ought to do the same with UTF8=on strings\, but the bug in msgsend is incontrovertible.

This bug is about pack doing the wrong thing. Please dont try to turn it into a different subject.

$ perl -MEncode=encode_utf8 -we'my $str="\x{100}" x 128; my $packed= pack "V/a"\, $str; print encode_utf8($packed)' | od -x 0000000 80c2 0000 c400 c480 c480 c480 c480 c480 0000020 c480 c480 c480 c480 c480 c480 c480 c480 * 0000400 c480 c480 0080 0000405

Why did the length get corrupted?!!! Why was there no warning when it was corrupted.

This does not work right\, and has not worked right ever since we introduce this stupid utf8 support into pack\, which has I bet NEVER done anything good for people using pack for interoperability. It might have made people using pack for internal perlish stuff easier/safer\, but it broke the primary purpose of pack.

Yves

p5pRT commented 10 years ago

From @demerphq

On 18 July 2014 09​:47\, Aristotle Pagaltzis \pagaltzis@&#8203;gmx\.de wrote​:

* Mike L via RT \perlbug\-followup@&#8203;perl\.org [2014-07-18 07​:35]​:

On Thu Jul 17 18​:30​:49 2014\, aristotle wrote​:

Same thing everything is supposed to do about it. Decode the buffer and send the character code points as byte values; *if* they fit into bytes\, that is\, otherwise warn and send the UTF8-encoded raw data – just like e.g. print does.

Yes\, and that's what it does now.

You are welcome to think so. The source code says otherwise.

That's why I don't think msgsnd is the problem - pack is. pack is supposed to pack things so the user can lay out the bytes the way they want them. If pack is going to produce a UTF8 string as it's output\, which will then have its bytes manipulated\, then pack() is not doing what it was designed to do - layout the bytes the way the caller specified.

The only catch being that Perl doesn’t have byte strings.

(It ought to\, I’m inclined to think. But status quo is it doesn’t.)

Nonsense. You and others have tried to force this broken view on the rest of us\, but that doesn't make it a status-quo\, it means the people who know you are wrong gave up arguing with you.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 10 years ago

From @demerphq

On 18 July 2014 10​:28\, demerphq \demerphq@&#8203;gmail\.com wrote​:

On 18 July 2014 09​:47\, Aristotle Pagaltzis \pagaltzis@&#8203;gmx\.de wrote​:

* Mike L via RT \perlbug\-followup@&#8203;perl\.org [2014-07-18 07​:35]​:

On Thu Jul 17 18​:30​:49 2014\, aristotle wrote​:

Same thing everything is supposed to do about it. Decode the buffer and send the character code points as byte values; *if* they fit into bytes\, that is\, otherwise warn and send the UTF8-encoded raw data – just like e.g. print does.

Yes\, and that's what it does now.

You are welcome to think so. The source code says otherwise.

That's why I don't think msgsnd is the problem - pack is. pack is supposed to pack things so the user can lay out the bytes the way they want them. If pack is going to produce a UTF8 string as it's output\, which will then have its bytes manipulated\, then pack() is not doing what it was designed to do - layout the bytes the way the caller specified.

The only catch being that Perl doesn’t have byte strings.

(It ought to\, I’m inclined to think. But status quo is it doesn’t.)

Nonsense. [elided]

My apologies. This particular subject makes me testier than usual and I spoke rudely and inappropriately.

I feel very strongly there is a bug here\, which I have seen catch multiple clever people coming from other languages using pack. Even experienced hands get bitten by this let alone newbies from other languages.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 10 years ago

From @ikegami

On Fri\, Jul 18\, 2014 at 4​:27 AM\, demerphq \demerphq@&#8203;gmail\.com wrote​:

$ perl -MEncode=encode_utf8 -we'my $str="\x{100}" x 128; my $packed= pack "V/a"\, $str; print encode_utf8($packed)' | od -x 0000000 80c2 0000 c400 c480 c480 c480 c480 c480 0000020 c480 c480 c480 c480 c480 c480 c480 c480 * 0000400 c480 c480 0080 0000405

Why did the length get corrupted?!!!

Because you encoded it using encode_utf8. Why would you do that?!

Why was there no warning when it was corrupted.

How can encode_utf8 know that you're encoding something that isn't meant to be used as text?

p5pRT commented 10 years ago

From @ap

* demerphq \demerphq@&#8203;gmail\.com [2014-07-18 15​:25]​:

On 18 July 2014 10​:28\, demerphq \demerphq@&#8203;gmail\.com wrote​:

Nonsense. [elided]

My apologies. This particular subject makes me testier than usual and I spoke rudely and inappropriately.

No apologies necessary.

p5pRT commented 10 years ago

From @rgs

On 16 July 2014 22​:18\, Mike L \perlbug\-followup@&#8203;perl\.org wrote​:

If you pass a UTF8 encoded string to pack() the output of pack() is now a UTF8 encoded string.

This is a serious bug. It's even possible to generate nonsense output due to that\, which perl should never allow​:

$ bleadperl -MDevel​::Peek -wE 'Dump pack "U0v/a"\,"foo\x{1f0}"' Character(s) in 'a' format wrapped in pack at -e line 1. SV = PV(0x7f82618040a0) at 0x7f8262003b30   REFCNT = 1   FLAGS = (PADTMP\,POK\,pPOK\,UTF8)   PV = 0x7f8261509950 "\4\0foo\360"\0Malformed UTF-8 character (1 byte\, need 4\, after start byte 0xf0) in Dump at -e line 1. [UTF8 "\x{4}\x{0}foo\x{0}"]   CUR = 6   LEN = 10

In my opinion the output of pack should never have the UTF8 flag on.

p5pRT commented 10 years ago

From @demerphq

On 25 July 2014 13​:25\, Rafael Garcia-Suarez \rgarciasuarez@&#8203;gmail\.com wrote​:

On 16 July 2014 22​:18\, Mike L \perlbug\-followup@&#8203;perl\.org wrote​:

If you pass a UTF8 encoded string to pack() the output of pack() is now a UTF8 encoded string.

This is a serious bug. It's even possible to generate nonsense output due to that\, which perl should never allow​:

$ bleadperl -MDevel​::Peek -wE 'Dump pack "U0v/a"\,"foo\x{1f0}"' Character(s) in 'a' format wrapped in pack at -e line 1. SV = PV(0x7f82618040a0) at 0x7f8262003b30 REFCNT = 1 FLAGS = (PADTMP\,POK\,pPOK\,UTF8) PV = 0x7f8261509950 "\4\0foo\360"\0Malformed UTF-8 character (1 byte\, need 4\, after start byte 0xf0) in Dump at -e line 1. [UTF8 "\x{4}\x{0}foo\x{0}"] CUR = 6 LEN = 10

In my opinion the output of pack should never have the UTF8 flag on.

Exactly. Never ever ever.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 10 years ago

From @ikegami

On Fri\, Jul 25\, 2014 at 7​:25 AM\, Rafael Garcia-Suarez \< rgarciasuarez@​gmail.com> wrote​:

On 16 July 2014 22​:18\, Mike L \perlbug\-followup@&#8203;perl\.org wrote​:

If you pass a UTF8 encoded string to pack() the output of pack() is now a UTF8 encoded string.

This is a serious bug. It's even possible to generate nonsense output due to that\, which perl should never allow​:

$ bleadperl -MDevel​::Peek -wE 'Dump pack "U0v/a"\,"foo\x{1f0}"' Character(s) in 'a' format wrapped in pack at -e line 1. SV = PV(0x7f82618040a0) at 0x7f8262003b30 REFCNT = 1 FLAGS = (PADTMP\,POK\,pPOK\,UTF8) PV = 0x7f8261509950 "\4\0foo\360"\0Malformed UTF-8 character (1 byte\, need 4\, after start byte 0xf0) in Dump at -e line 1. [UTF8 "\x{4}\x{0}foo\x{0}"] CUR = 6 LEN = 10

What you describe is a problem with U0 specifically. U0 is broken.

In my opinion the output of pack should never have the UTF8 flag on.

You'd break C\<\< pack("n/a*"\, "\x{2660}"); >>