Open p5pRT opened 13 years ago
Hello\,
I'm using ActivePerl 5.12.2 on Windows 7.
Perl for Win32 has a feature to convert a single "LF" (without preceding "CR") to "CRLF"\, but my perl seems to determine what "LF" is on UTF-16 incorrectly. In ANSI and UTF-8 files\, LF's bytecode is "0A"; in UTF-16\, LF should be "000A" (Big Endian) Â or "0A00" (Little Endian)\, but my perl seems to regard single "0A" as LF too! Thus\, she will do the wrong thing\, which is adding a "0D" before "0A" (my perl also regard "0D" as CR\, the right CR in UTF-16 should be "000A").
Here's the test program:
open FH_IN\, "\<:encoding(utf16be)"\, "src.txt" or die; open FH_OUT\, ">:encoding(utf16be)"\, "output.txt" or die;
while (\<FH_IN>) { Â Â print FH_OUT $_; }
I think "src.txt" and "output.txt" should be identical. But not.
1) if "src.txt" is only two CRLFs\, its bytecodes are "FE FF 00 0D 00 0A 00 0D 00 0A"; the "output.txt" becomes "FE FF 00 0D 00 0D 0A 00 0D 00 0D 0A"\, each "0A" gets a unnecessary and wrong preceding "0D".
2) if "src.txt" is only one chinese charater "ä¸"\, whose unicode and UTF-16BE bytecode is "4E 0A"\, with BOM\, the file's whole bytes are "FE FF 4E 0A"; the "output.txt" becomes "FE FF 4E 0D 0A".
modify the program code:
while (\<FH_IN>) { Â Â chomp; Â Â print FH_OUT $_; }
1) "src.txt" which is only two CRLFs\, "FE FF 00 0D 00 0A 00 0D 00 0A" becomes "FE FF 00 0D 00 0D". So\, chomp only get rid of LF(00 0A). it should erase 4 bytes "00 0D 00 0A".
That's what I found when operating UTF-16 files. I'll appreciate your efforts to improve Unicode support. Many thanks!
 Joey
þÿ
þÿN
On Wed Dec 01 04:38:26 2010\, mezmerik@gmail.com wrote:
This is a bug report for perl from mezmerik@gmail.com\, generated with the help of perlbug 1.39 running under perl 5.12.2.
----------------------------------------------------------------- [Please describe your issue here]
Hello\,
I'm using ActivePerl 5.12.2 on Windows 7.
Perl for Win32 has a feature to convert a single "LF" (without preceding "CR") to "CRLF"\, but my perl seems to determine what "LF" is on UTF-16 incorrectly. In ANSI and UTF-8 files\, LF's bytecode is "0A"; in UTF-16\, LF should be "000A" (Big Endian) Â or "0A00" (Little Endian)\, but my perl seems to regard single "0A" as LF too! Thus\, she will do the wrong thing\, which is adding a "0D" before "0A" (my perl also regard "0D" as CR\, the right CR in UTF-16 should be "000A").
I believe this is a known problem with the way the default :crlf layer works on Win32.
Since the layer is immediately on top of the :unix layer\, it's working at a byte level\, adding CRs to the bytes *after* translation from characters.
This means you get other broken behaviour\, such as inserting a 0d byte before characters in the U+AXX range:
C:\Users\tony>perl -e "open my $fh\, '>:encoding(utf16be)'\, 'foo.txt' or die $!; print $fh qq(\x{a90}hello\n)"
C:\Users\tony>perl -e "binmode STDIN; $/ = \16; while (\<>) { print unpack('H*'\, $_)\, qq'\n' }" \<foo.txt 0d0a9000680065006c006c006f000d0a
The workaround (or perhaps the only real fix) is to clear the :crlf layer and add it back on above your unicode layer:
C:\Users\tony>perl -e "open my $fh\, '>:raw:encoding(utf16be):crlf'\, 'foo.txt' or die $!; print $fh qq(\x{a90}hello\n)"
C:\Users\tony>perl -e "binmode STDIN; $/ = \16; while (\<>) { print unpack('H*'\, $_)\, qq'\n' }" \<foo.txt 0a9000680065006c006c006f000d000a
The only way I can see to fix this would be to make :crlf special\, so it always remains on top\, but I suspect that's going to be fairly ugly from an implementation point of view - do we make other layers special too?
(/me avoids going wild with speculation)
Tony
The RT System itself - Status changed from 'new' to 'open'
On Thu\, Sep 19\, 2013 at 10:27:32PM -0700\, Tony Cook via RT wrote:
I believe this is a known problem with the way the default :crlf layer works on Win32.
Since the layer is immediately on top of the :unix layer\, it's working at a byte level\, adding CRs to the bytes *after* translation from characters.
The only way I can see to fix this would be to make :crlf special\, so it always remains on top\, but I suspect that's going to be fairly ugly from an implementation point of view - do we make other layers special too?
I guess it's a kind of (emergent) flaw with the whole design of layers. In that layers can meaningfully be anything of
text -> binary (eg Unicode -> UTF-8) binary -> binary (eg gzip) binary -> text (eg uuencode\, or these days Base64) text -> text (pedantically rot13)
(in terms of output) (I think I've read that Python 3 went too far the other way on this by banning all but the first)
I think on that categorisation LF -> CRLF would be text -> text.
(Certainly for output it's not binary -> text or binary -> binary\, as it corrupts binary data.)
(/me avoids going wild with speculation)
So the design of layers ought to categorise their feed and not-feed* sides as text or binary\, and forbid plugging the wrong sorts together.
If we had that\, then attempting to push UTF-16 atop CRLF would be an error immediately. But\, of course\, the DWIM approach is that pushing UTF-16 onto CRLF would cause UTF-16 to burrow under CRLF\, given that issuing an error of the form of "I can see what you're trying to do\, but I'm not going to help you" isn't very nice.
I believe that the second half of Jarkko's quote applies:
Documenting bugs before they're found is kinda hard. Can I borrow your time machine? Mine won't start.
Failing that\, we wrestle with the punchline to the "Irishman giving directions" joke\, and wonder how to retrofit sanity.
Gah. It keeps coming back to this - to handle Unicode properly\, you need a type system. Or at least enough of a type system to distinguish "text" from "binary".
Nicholas Clark
* feed is a word visibly distinct from input. I'm failing to find a suitable antonym for this meaning of feed.
On Fri\, Sep 20\, 2013 at 8:58 AM\, Nicholas Clark \nick@​ccl4\.org wrote:
Gah. It keeps coming back to this - to handle Unicode properly\, you need a type system. Or at least enough of a type system to distinguish "text" from "binary".
I fully agree\, but I don't see how that would help here. Wouldn't that prevent :crlf (a text processor) from being placed on a binary handle as Perl does?
:crlf is a special case. It would therefore make sense for :encoding to handle it specially and "burrow under" it\, as you called it. This is independent of the existence of text/binary string semantic system.
On Fri\, Sep 20\, 2013 at 10:18:33AM -0400\, Eric Brine wrote:
On Fri\, Sep 20\, 2013 at 8:58 AM\, Nicholas Clark \nick@​ccl4\.org wrote:
Gah. It keeps coming back to this - to handle Unicode properly\, you need a type system. Or at least enough of a type system to distinguish "text" from "binary".
I fully agree\, but I don't see how that would help here. Wouldn't that prevent :crlf (a text processor) from being placed on a binary handle as Perl does?
Yes\, it would\, if handles default to binary. But I think that then that's part of the mess. In that in a Unicode world\, every platform needs to care about whether a handle is binary or text. And the old convenience of "just" opening a file\, without (at that point) caring whether it's a CSV or a JPEG goes out of the window.
:crlf is a special case. It would therefore make sense for :encoding to handle it specially and "burrow under" it\, as you called it. This is independent of the existence of text/binary string semantic system.
Problem is that it's no more of a special case than a layer that converts all Unicode line endings to LF. Or a layer that does NFD. It's not unique. That's what's bugging me.
You sort of need some sort of "apply layer" logic\, which assumes
FILE -> [binary -> binary] {0\,*} -> [binary -> text] -> [text -> text] {0\,*}
at which point\, applying any binary->binary or text->text layer *stacks* it at the right point\, and applying any binary->text layer swaps out the previous.
(And I might be missing one in that diagram - maybe FILE should be [source -> binary]\, which would solve the :stdio vs :unix mess)
And if you want to apply a [text -> binary] or build more funky than the above model permits\, or for (some reason) remove a layer\, you use a second API which does "build the entire stack"\, or "splice".
Nicholas Clark
On Fri\, 20 Sep 2013 15:29:16 +0100\, Nicholas Clark \nick@​ccl4\.org wrote:
On Fri\, Sep 20\, 2013 at 10:18:33AM -0400\, Eric Brine wrote:
On Fri\, Sep 20\, 2013 at 8:58 AM\, Nicholas Clark \nick@​ccl4\.org wrote:
Gah. It keeps coming back to this - to handle Unicode properly\, you need a type system. Or at least enough of a type system to distinguish "text" from "binary".
I fully agree\, but I don't see how that would help here. Wouldn't that prevent :crlf (a text processor) from being placed on a binary handle as Perl does?
Yes\, it would\, if handles default to binary. But I think that then that's part of the mess. In that in a Unicode world\, every platform needs to care about whether a handle is binary or text. And the old convenience of "just" opening a file\, without (at that point) caring whether it's a CSV or a JPEG goes out of the window.
Not that is happens a lot\, but in CSV there is no overall encoding. The CSV format allows you to pass every line/record in a different encoding or even every field within a line/record. Not that that would be a sane thing to do\, but the definition allows that :(
What *does* happen (quite too often) is that the lines are exported in CSV with iso-8895-1 when every character falls in that range and in UTF-8 when the record contains a field that contains a character outside of the iso range. The decoder now has to check twice for validity. These generators suck\, but I have to deal with their output on a daily basis.
:crlf is a special case. It would therefore make sense for :encoding to handle it specially and "burrow under" it\, as you called it. This is independent of the existence of text/binary string semantic system.
Problem is that it's no more of a special case than a layer that converts all Unicode line endings to LF. Or a layer that does NFD. It's not unique. That's what's bugging me.
You sort of need some sort of "apply layer" logic\, which assumes
FILE -> [binary -> binary] {0\,*} -> [binary -> text] -> [text -> text] {0\,*}
at which point\, applying any binary->binary or text->text layer *stacks* it at the right point\, and applying any binary->text layer swaps out the previous.
(And I might be missing one in that diagram - maybe FILE should be [source -> binary]\, which would solve the :stdio vs :unix mess)
And if you want to apply a [text -> binary] or build more funky than the above model permits\, or for (some reason) remove a layer\, you use a second API which does "build the entire stack"\, or "splice".
Nicholas Clark
-- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.19 porting perl5 on HP-UX\, AIX\, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
Nicholas Clark wrote:
at which point\, applying any binary->binary or text->text layer *stacks* it at the right point\, and applying any binary->text layer swaps out the previous.
It sounds like the concept of "applying" a layer has been overloaded beyond usefulness. Inserting layers at different positions are different operations\, and replacing a layer (or group of layers) is different again.
-zefram
On Fri\, Sep 20\, 2013 at 04:00:25PM +0100\, Zefram wrote:
Nicholas Clark wrote:
at which point\, applying any binary->binary or text->text layer *stacks* it at the right point\, and applying any binary->text layer swaps out the previous.
It sounds like the concept of "applying" a layer has been overloaded beyond usefulness. Inserting layers at different positions are different operations\, and replacing a layer (or group of layers) is different again.
Yes\, half the time I agree with you here. It's too complex to be useful.
But it's bugging me that the only 2 frequent operations a programmer does are
1) State that the file is binary 2) State that the file is text in a particular encoding
with the bothersome problem that the default is text\, with platform specific line ending post-processing\, which should be retained on a text file even if the encoding is changed from the default.
And that (1) and (2) above ought to be easy to do\, without needing to resort to a more flexible syntax.
Nicholas Clark
On Fri Sep 20 05:59:34 2013\, nicholas wrote:
I believe that the second half of Jarkko's quote applies:
Documenting bugs before they're found is kinda hard\. Can I borrow your time machine? Mine won't start\.
Failing that\, we wrestle with the punchline to the "Irishman giving directions" joke\, and wonder how to retrofit sanity.
Gah. It keeps coming back to this - to handle Unicode properly\, you need a type system. Or at least enough of a type system to distinguish "text" from "binary".
I usually just work around the whole issue with explicit encode/decode. Also\, where possible\, I avoid UTF-16 and Windows. Life is so much simpler that way!
Nicholas Clark
* feed is a word visibly distinct from input. I'm failing to find a suitable antonym for this meaning of feed.
The opposite of feed is usually starve. But that doesnât work here. Maybe spew? Extort?
--
Father Chrysostomos
On Fri\, Sep 20\, 2013 at 7:27 AM\, Tony Cook via RT \<perlbug-followup@perl.org
wrote:
Hello\,
I'm using ActivePerl 5.12.2 on Windows 7.
Perl for Win32 has a feature to convert a single "LF" (without preceding "CR") to "CRLF"\, but my perl seems to determine what "LF" is on UTF-16 incorrectly. In ANSI and UTF-8 files\, LF's bytecode is "0A"; in UTF-16\, LF should be "000A" (Big Endian) or "0A00" (Little Endian)\, but my perl seems to regard single "0A" as LF too! Thus\, she will do the wrong thing\, which is adding a "0D" before "0A" (my perl also regard "0D" as CR\, the right CR in UTF-16 should be "000A").
I believe this is a known problem with the way the default :crlf layer works on Win32.
Since the layer is immediately on top of the :unix layer\, it's working at a byte level\, adding CRs to the bytes *after* translation from characters.
This means you get other broken behaviour\, such as inserting a 0d byte before characters in the U+AXX range:
C:\Users\tony>perl -e "open my $fh\, '>:encoding(utf16be)'\, 'foo.txt' or die $!; print $fh qq(\x{a90}hello\n)"
C:\Users\tony>perl -e "binmode STDIN; $/ = \16; while (\<>) { print unpack('H*'\, $_)\, qq'\n' }" \<foo.txt 0d0a9000680065006c006c006f000d0a
The workaround (or perhaps the only real fix) is to clear the :crlf layer and add it back on above your unicode layer:
C:\Users\tony>perl -e "open my $fh\, '>:raw:encoding(utf16be):crlf'\, 'foo.txt' or die $!; print $fh qq(\x{a90}hello\n)"
All correct. I wrote the ":text" pseudo layer that shortens that to ':text(utf-16be)'\, except that it will not to that whole dance on unix systems.
Also note that before 5.14\, binmode $fh\, ':raw:encoding(utf-16be):crlf' did not work correctly.
The only way I can see to fix this would be to make :crlf special\, so it always remains on top\, but I suspect that's going to be fairly ugly from an implementation point of view - do we make other layers special too?
(/me avoids going wild with speculation)
The way to correct this is to make open be sensible. That is not a trivial problem.
Leon
On 20 September 2013 16:24\, Father Chrysostomos via RT \perlbug\-followup@​perl\.org wrote:
* feed is a word visibly distinct from input. I'm failing to find a suitable antonym for this meaning of feed.
The opposite of feed is usually starve. But that doesnât work here. Maybe spew? Extort?
drain?
* Nicholas Clark \nick@​ccl4\.org [2013-09-20 15:00]:
I guess it's a kind of (emergent) flaw with the whole design of layers. In that layers can meaningfully be anything of
text \-> binary \(eg Unicode \-> UTF\-8\) binary \-> binary \(eg gzip\) binary \-> text \(eg uuencode\, or these days Base64\) text \-> text \(pedantically rot13\)
(in terms of output) (I think I've read that Python 3 went too far the other way on this by banning all but the first)
I think on that categorisation LF -> CRLF would be text -> text.
(Certainly for output it's not binary -> text or binary -> binary\, as it corrupts binary data.)
So the design of layers ought to categorise their feed and not-feed* sides as text or binary\, and forbid plugging the wrong sorts together.
If we had that\, then attempting to push UTF-16 atop CRLF would be an error immediately. But\, of course\, the DWIM approach is that pushing UTF-16 onto CRLF would cause UTF-16 to burrow under CRLF\, given that issuing an error of the form of "I can see what you're trying to do\, but I'm not going to help you" isn't very nice.
I believe that the second half of Jarkko's quote applies:
Documenting bugs before they're found is kinda hard\. Can I borrow your time machine? Mine won't start\.
Failing that\, we wrestle with the punchline to the "Irishman giving directions" joke\, and wonder how to retrofit sanity.
Do we want to provide first-class support for layer cakes like this?
(text â binary) (binary â text) (text â binary)
Because if we donât\, then the solution would seem to be very easy\, at least at the conceptual level: make each handle have two stacks\, one for (text â text) layers and another one for (binary â binary)\, plus a single slot for (text â binary). And then you *set* this single slot (no pushing/popping there)\, plus you push/pop the other types of layers on their respective stacks.
In that design\, we also have a (text â binary) layer (named âderpâ? :-)) whose output direction implements the current behaviour of `print` and friends\, wherein they try to downgrade a string for output but warn and output the utf8 buffer as bytes if they canât.
(As far as I can see there is no reason to have (binary â text) layers if layers can go in both directions depending on whether theyâre applied to input or output (as is currently the case â you push :encoding(UTF-8) no matter whether itâs an input or output handle).)
âDerpâ is then the default (text â binary) layer for handles on which nothing else has been set. This solves the question of âhow do I set :crlf on an otherwise unconfigured handle if layers are typed?â
Note that the solves the problem with pushing UTF-16 onto CRLF\, because in this design\, you donât do that â you *set* UTF-16 for the conversion slot\, and in any case the CRLF layer is in a stack by itself so if you push any (binary â binary) layers\, they will push âunderâ the CRLF layer automatically. So by this design PerlIO will DTRT automatically.
Iâd suggest a migration in which (text â text)\, (binary â binary) and (text â binary) layers all move into different namespaces\, so that once completed it becomes impossible to even *say* the wrong thing.
Note that even layer cakes can be supported as a second-class construct\, by providing a reverse-direction pseudo-layer that implements a nested layer stack\, which you can then push onto an layer stack as a unit.
(This even neatly solves the question of how code is supposed to keep track of the relative ordering of layers in really complex situations. If you have build layer cakes out of (possibly recursively) nested stacks then each participating bit of code only needs to care about the nested stack it is managing itself\, and by virtue of the fixed order of layers within a handle or a pseudo-layer\, the overall resulting pipeline is guaranteed to assemble into something that makes sense.)
Now â how we turn what we have now into the system I outlined is quite another matterâ¦
⦠or maybe it ainât? How hard do the PerlIO people here think this might be? (Leon?)
* feed is a word visibly distinct from input. I'm failing to find a suitable antonym for this meaning of feed.
Spew? :-)
-- No trees were harmed in the transmission of this email but trillions of electrons were excited to participate.
* Robert May \robertmay@​cpan\.org [2013-09-20 17:45]:
On 20 September 2013 16:24\, Father Chrysostomos wrote:
The opposite of feed is usually starve. But that doesnât work here. Maybe spew? Extort?
drain?
Ah\, yâall jogged my memory. Nicholas is looking for âsourceâ and âsinkâ I think â cf. \https://en.wikipedia.org/wiki/Sink_%28computing%29.
-- Aristotle Pagaltzis // \<http://plasmasturm.org/>
On Fri\, Sep 20\, 2013 at 2:58 PM\, Nicholas Clark \nick@​ccl4\.org wrote:
On Thu\, Sep 19\, 2013 at 10:27:32PM -0700\, Tony Cook via RT wrote:
I believe this is a known problem with the way the default :crlf layer works on Win32.
Since the layer is immediately on top of the :unix layer\, it's working at a byte level\, adding CRs to the bytes *after* translation from characters.
The only way I can see to fix this would be to make :crlf special\, so it always remains on top\, but I suspect that's going to be fairly ugly from an implementation point of view - do we make other layers special too?
I guess it's a kind of (emergent) flaw with the whole design of layers. In that layers can meaningfully be anything of
text \-> binary \(eg Unicode \-> UTF\-8\) binary \-> binary \(eg gzip\) binary \-> text \(eg uuencode\, or these days Base64\) text \-> text \(pedantically rot13\)
(in terms of output) (I think I've read that Python 3 went too far the other way on this by banning all but the first)
I think on that categorisation LF -> CRLF would be text -> text.
(Certainly for output it's not binary -> text or binary -> binary\, as it corrupts binary data.)
Except that PerlIO internally doesn't work in terms of text or binary\, but in terms of latin-1/binary octets versus utf8 octets :-/.
So for example if you'd open a ":encoding(utf-16be):bytes". You'd read UTF-16 converted to UTF-8 and then interpreted as Latin-1. Obviously not something someone would deliberately do\, but the fact that you can do it accidentally is bad enough.
Another dimension of this is that some layers only make sense at the bottom (e.g. :unix)\, and others only above such a bottom layer (e.g. most layers).
So the design of layers ought to categorise their feed and not-feed* sides as text or binary\, and forbid plugging the wrong sorts together.
If we had that\, then attempting to push UTF-16 atop CRLF would be an error immediately. But\, of course\, the DWIM approach is that pushing UTF-16 onto CRLF would cause UTF-16 to burrow under CRLF\, given that issuing an error of the form of "I can see what you're trying to do\, but I'm not going to help you" isn't very nice.
Yes\, absolutely!
I believe that the second half of Jarkko's quote applies:
Documenting bugs before they're found is kinda hard\. Can I borrow your time machine? Mine won't start\.
Failing that\, we wrestle with the punchline to the "Irishman giving directions" joke\, and wonder how to retrofit sanity.
Gah. It keeps coming back to this - to handle Unicode properly\, you need a type system. Or at least enough of a type system to distinguish "text" from "binary".
I'm wondering if it's really too late. Given how much brokenness there is in this area
* feed is a word visibly distinct from input. I'm failing to find a suitable antonym for this meaning of feed.
I prefer to just call them top and bottom.
Leon
On Fri\, Sep 20\, 2013 at 6:26 PM\, Aristotle Pagaltzis \pagaltzis@​gmx\.dewrote:
Ah\, yâall jogged my memory. Nicholas is looking for âsourceâ and âsinkâ I think â cf. \https://en.wikipedia.org/wiki/Sink_%28computing%29.
IO goes in both directions; so one side will be the source for input but the sink for output and vice versa.
Source and sink would be *terribly* confusing terms.
Leon
Migrated from rt.perl.org#80058 (status was 'open')
Searchable as RT80058$