Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.9k stars 540 forks source link

[Bug Report] Bad \n convert, using UTF-16 on Win32 #10869

Open p5pRT opened 13 years ago

p5pRT commented 13 years ago

Migrated from rt.perl.org#80058 (status was 'open')

Searchable as RT80058$

p5pRT commented 13 years ago

From mezmerik@gmail.com

Created by mezmerik@gmail.com

Hello\,

I'm using ActivePerl 5.12.2 on Windows 7.

Perl for Win32 has a feature to convert a single "LF" (without preceding "CR") to "CRLF"\, but my perl seems to determine what "LF" is on UTF-16 incorrectly. In ANSI and UTF-8 files\, LF's bytecode is "0A"; in UTF-16\, LF should be "000A" (Big Endian)  or "0A00" (Little Endian)\, but my perl seems to regard single "0A" as LF too! Thus\, she will do the wrong thing\, which is adding a "0D" before "0A" (my perl also regard "0D" as CR\, the right CR in UTF-16 should be "000A").

Here's the test program​:

open FH_IN\, "\<​:encoding(utf16be)"\, "src.txt" or die; open FH_OUT\, ">​:encoding(utf16be)"\, "output.txt" or die;

while (\<FH_IN>) {    print FH_OUT $_; }

I think "src.txt" and "output.txt" should be identical. But not.

1) if "src.txt" is only two CRLFs\, its bytecodes are "FE FF 00 0D 00 0A 00 0D 00 0A"; the "output.txt" becomes "FE FF 00 0D 00 0D 0A 00 0D 00 0D 0A"\, each "0A" gets a unnecessary and wrong preceding "0D".

2) if "src.txt" is only one chinese charater "上"\, whose unicode and UTF-16BE bytecode is "4E 0A"\, with BOM\, the file's whole bytes are "FE FF 4E 0A"; the "output.txt" becomes "FE FF 4E 0D 0A".

modify the program code​:

while (\<FH_IN>) {    chomp;    print FH_OUT $_; }

1) "src.txt" which is only two CRLFs\, "FE FF 00 0D 00 0A 00 0D 00 0A" becomes "FE FF 00 0D 00 0D". So\, chomp only get rid of LF(00 0A). it should erase 4 bytes "00 0D 00 0A".

That's what I found when operating UTF-16 files. I'll appreciate your efforts to improve Unicode support. Many thanks!

 Joey

Perl Info ``` Flags:    category=core    severity=low Site configuration information for perl 5.12.2: Configured by SYSTEM at Mon Sep  6 23:12:49 2010. Summary of my perl5 (revision 5 version 12 subversion 2) configuration:  Platform:    osname=MSWin32, osvers=5.00, archname=MSWin32-x86-multi-thread    uname=''    config_args='undef'    hint=recommended, useposix=true, d_sigaction=undef    useithreads=define, usemultiplicity=define    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef    use64bitint=undef, use64bitall=undef, uselongdouble=undef    usemymalloc=n, bincompat5005=undef  Compiler:    cc='cl', ccflags ='-nologo -GF -W3 -MD -Zi -DNDEBUG -O1 -DWIN32 -D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT -DUSE_SITECUSTOMIZE -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -D_USE_32BIT_TIME_T -DPERL_MSVCRT_READFIX',    optimize='-MD -Zi -DNDEBUG -O1',    cppflags='-DWIN32'    ccversion='12.00.8804', gccversion='', gccosandvers=''    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234    d_longlong=undef, longlongsize=8, d_longdbl=define, longdblsize=8    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='__int64', lseeksize=8    alignbytes=8, prototype=define  Linker and Libraries:    ld='link', ldflags ='-nologo -nodefaultlib -debug -opt:ref,icf -libpath:"C:\Perl\lib\CORE"  -machine:x86'    libpth=\lib    libs=  oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib  version.lib odbc32.lib odbccp32.lib comctl32.lib msvcrt.lib    perllibs=  oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib  comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib  netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib comctl32.lib msvcrt.lib    libc=msvcrt.lib, so=dll, useshrplib=true, libperl=perl512.lib    gnulibc_version=''  Dynamic Linking:    dlsrc=dl_win32.xs, dlext=dll, d_dlsymun=undef, ccdlflags=' '    cccdlflags=' ', lddlflags='-dll -nologo -nodefaultlib -debug -opt:ref,icf  -libpath:"C:\Perl\lib\CORE"  -machine:x86' Locally applied patches:    ACTIVEPERL_LOCAL_PATCHES_ENTRY    1fd8fa4 Add Wolfram Humann to AUTHORS    f120055 make string-append on win32 100 times faster    a2a8d15 Define _USE_32BIT_TIME_T for VC6 and VC7    007cfe1 Don't pretend to support really old VC++ compilers    6d8f7c9 Get rid of obsolete PerlCRT.dll support    d956618 Make Term::ReadLine::findConsole fall back to STDIN if /dev/tty can't be opened    321e50c Escape patch strings before embedding them in patchlevel.h @INC for perl 5.12.2:    C:/Perl/site/lib    C:/Perl/lib    . Environment for perl 5.12.2:    HOME (unset)    LANG (unset)    LANGUAGE (unset)    LD_LIBRARY_PATH (unset)    LOGDIR (unset)    PATH=C:\Program Files\ActiveState Komodo IDE 6\;C:\Perl\site\bin;C:\Perl\bin;C:\Program Files\NVIDIA Corporation\PhysX\Common;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\    PERL_BADLANG (unset)    SHELL (unset) ```
p5pRT commented 13 years ago

From mezmerik@gmail.com

þÿ

p5pRT commented 13 years ago

From mezmerik@gmail.com

þÿN

p5pRT commented 10 years ago

From @tonycoz

On Wed Dec 01 04​:38​:26 2010\, mezmerik@​gmail.com wrote​:

This is a bug report for perl from mezmerik@​gmail.com\, generated with the help of perlbug 1.39 running under perl 5.12.2.

----------------------------------------------------------------- [Please describe your issue here]

Hello\,

I'm using ActivePerl 5.12.2 on Windows 7.

Perl for Win32 has a feature to convert a single "LF" (without preceding "CR") to "CRLF"\, but my perl seems to determine what "LF" is on UTF-16 incorrectly. In ANSI and UTF-8 files\, LF's bytecode is "0A"; in UTF-16\, LF should be "000A" (Big Endian)  or "0A00" (Little Endian)\, but my perl seems to regard single "0A" as LF too! Thus\, she will do the wrong thing\, which is adding a "0D" before "0A" (my perl also regard "0D" as CR\, the right CR in UTF-16 should be "000A").

I believe this is a known problem with the way the default :crlf layer works on Win32.

Since the layer is immediately on top of the :unix layer\, it's working at a byte level\, adding CRs to the bytes *after* translation from characters.

This means you get other broken behaviour\, such as inserting a 0d byte before characters in the U+AXX range​:

C​:\Users\tony>perl -e "open my $fh\, '>​:encoding(utf16be)'\, 'foo.txt' or die $!; print $fh qq(\x{a90}hello\n)"

C​:\Users\tony>perl -e "binmode STDIN; $/ = \16; while (\<>) { print unpack('H*'\, $_)\, qq'\n' }" \<foo.txt 0d0a9000680065006c006c006f000d0a

The workaround (or perhaps the only real fix) is to clear the :crlf layer and add it back on above your unicode layer​:

C​:\Users\tony>perl -e "open my $fh\, '>​:raw​:encoding(utf16be)​:crlf'\, 'foo.txt' or die $!; print $fh qq(\x{a90}hello\n)"

C​:\Users\tony>perl -e "binmode STDIN; $/ = \16; while (\<>) { print unpack('H*'\, $_)\, qq'\n' }" \<foo.txt 0a9000680065006c006c006f000d000a

The only way I can see to fix this would be to make :crlf special\, so it always remains on top\, but I suspect that's going to be fairly ugly from an implementation point of view - do we make other layers special too?

(/me avoids going wild with speculation)

Tony

p5pRT commented 10 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 10 years ago

From @nwc10

On Thu\, Sep 19\, 2013 at 10​:27​:32PM -0700\, Tony Cook via RT wrote​:

I believe this is a known problem with the way the default :crlf layer works on Win32.

Since the layer is immediately on top of the :unix layer\, it's working at a byte level\, adding CRs to the bytes *after* translation from characters.

The only way I can see to fix this would be to make :crlf special\, so it always remains on top\, but I suspect that's going to be fairly ugly from an implementation point of view - do we make other layers special too?

I guess it's a kind of (emergent) flaw with the whole design of layers. In that layers can meaningfully be anything of

  text -> binary (eg Unicode -> UTF-8)   binary -> binary (eg gzip)   binary -> text (eg uuencode\, or these days Base64)   text -> text (pedantically rot13)

(in terms of output) (I think I've read that Python 3 went too far the other way on this by banning all but the first)

I think on that categorisation LF -> CRLF would be text -> text.

(Certainly for output it's not binary -> text or binary -> binary\, as it corrupts binary data.)

(/me avoids going wild with speculation)

So the design of layers ought to categorise their feed and not-feed* sides as text or binary\, and forbid plugging the wrong sorts together.

If we had that\, then attempting to push UTF-16 atop CRLF would be an error immediately. But\, of course\, the DWIM approach is that pushing UTF-16 onto CRLF would cause UTF-16 to burrow under CRLF\, given that issuing an error of the form of "I can see what you're trying to do\, but I'm not going to help you" isn't very nice.

I believe that the second half of Jarkko's quote applies​:

  Documenting bugs before they're found is kinda hard.   Can I borrow your time machine? Mine won't start.

Failing that\, we wrestle with the punchline to the "Irishman giving directions" joke\, and wonder how to retrofit sanity.

Gah. It keeps coming back to this - to handle Unicode properly\, you need a type system. Or at least enough of a type system to distinguish "text" from "binary".

Nicholas Clark

* feed is a word visibly distinct from input. I'm failing to find a suitable   antonym for this meaning of feed.  

p5pRT commented 10 years ago

From @ikegami

On Fri\, Sep 20\, 2013 at 8​:58 AM\, Nicholas Clark \nick@&#8203;ccl4\.org wrote​:

Gah. It keeps coming back to this - to handle Unicode properly\, you need a type system. Or at least enough of a type system to distinguish "text" from "binary".

I fully agree\, but I don't see how that would help here. Wouldn't that prevent :crlf (a text processor) from being placed on a binary handle as Perl does?

:crlf is a special case. It would therefore make sense for :encoding to handle it specially and "burrow under" it\, as you called it. This is independent of the existence of text/binary string semantic system.

p5pRT commented 10 years ago

From @nwc10

On Fri\, Sep 20\, 2013 at 10​:18​:33AM -0400\, Eric Brine wrote​:

On Fri\, Sep 20\, 2013 at 8​:58 AM\, Nicholas Clark \nick@&#8203;ccl4\.org wrote​:

Gah. It keeps coming back to this - to handle Unicode properly\, you need a type system. Or at least enough of a type system to distinguish "text" from "binary".

I fully agree\, but I don't see how that would help here. Wouldn't that prevent :crlf (a text processor) from being placed on a binary handle as Perl does?

Yes\, it would\, if handles default to binary. But I think that then that's part of the mess. In that in a Unicode world\, every platform needs to care about whether a handle is binary or text. And the old convenience of "just" opening a file\, without (at that point) caring whether it's a CSV or a JPEG goes out of the window.

:crlf is a special case. It would therefore make sense for :encoding to handle it specially and "burrow under" it\, as you called it. This is independent of the existence of text/binary string semantic system.

Problem is that it's no more of a special case than a layer that converts all Unicode line endings to LF. Or a layer that does NFD. It's not unique. That's what's bugging me.

You sort of need some sort of "apply layer" logic\, which assumes

FILE -> [binary -> binary] {0\,*} -> [binary -> text] -> [text -> text] {0\,*}

at which point\, applying any binary->binary or text->text layer *stacks* it at the right point\, and applying any binary->text layer swaps out the previous.

(And I might be missing one in that diagram - maybe FILE should be [source -> binary]\, which would solve the :stdio vs :unix mess)

And if you want to apply a [text -> binary] or build more funky than the above model permits\, or for (some reason) remove a layer\, you use a second API which does "build the entire stack"\, or "splice".

Nicholas Clark

p5pRT commented 10 years ago

From @tux

On Fri\, 20 Sep 2013 15​:29​:16 +0100\, Nicholas Clark \nick@&#8203;ccl4\.org wrote​:

On Fri\, Sep 20\, 2013 at 10​:18​:33AM -0400\, Eric Brine wrote​:

On Fri\, Sep 20\, 2013 at 8​:58 AM\, Nicholas Clark \nick@&#8203;ccl4\.org wrote​:

Gah. It keeps coming back to this - to handle Unicode properly\, you need a type system. Or at least enough of a type system to distinguish "text" from "binary".

I fully agree\, but I don't see how that would help here. Wouldn't that prevent :crlf (a text processor) from being placed on a binary handle as Perl does?

Yes\, it would\, if handles default to binary. But I think that then that's part of the mess. In that in a Unicode world\, every platform needs to care about whether a handle is binary or text. And the old convenience of "just" opening a file\, without (at that point) caring whether it's a CSV or a JPEG goes out of the window.

Not that is happens a lot\, but in CSV there is no overall encoding. The CSV format allows you to pass every line/record in a different encoding or even every field within a line/record. Not that that would be a sane thing to do\, but the definition allows that :(

What *does* happen (quite too often) is that the lines are exported in CSV with iso-8895-1 when every character falls in that range and in UTF-8 when the record contains a field that contains a character outside of the iso range. The decoder now has to check twice for validity. These generators suck\, but I have to deal with their output on a daily basis.

:crlf is a special case. It would therefore make sense for :encoding to handle it specially and "burrow under" it\, as you called it. This is independent of the existence of text/binary string semantic system.

Problem is that it's no more of a special case than a layer that converts all Unicode line endings to LF. Or a layer that does NFD. It's not unique. That's what's bugging me.

You sort of need some sort of "apply layer" logic\, which assumes

FILE -> [binary -> binary] {0\,*} -> [binary -> text] -> [text -> text] {0\,*}

at which point\, applying any binary->binary or text->text layer *stacks* it at the right point\, and applying any binary->text layer swaps out the previous.

(And I might be missing one in that diagram - maybe FILE should be [source -> binary]\, which would solve the :stdio vs :unix mess)

And if you want to apply a [text -> binary] or build more funky than the above model permits\, or for (some reason) remove a layer\, you use a second API which does "build the entire stack"\, or "splice".

Nicholas Clark

-- H.Merijn Brand http​://tux.nl Perl Monger http​://amsterdam.pm.org/ using perl5.00307 .. 5.19 porting perl5 on HP-UX\, AIX\, and openSUSE http​://mirrors.develooper.com/hpux/ http​://www.test-smoke.org/ http​://qa.perl.org http​://www.goldmark.org/jeff/stupid-disclaimers/

p5pRT commented 10 years ago

From zefram@fysh.org

Nicholas Clark wrote​:

at which point\, applying any binary->binary or text->text layer *stacks* it at the right point\, and applying any binary->text layer swaps out the previous.

It sounds like the concept of "applying" a layer has been overloaded beyond usefulness. Inserting layers at different positions are different operations\, and replacing a layer (or group of layers) is different again.

-zefram

p5pRT commented 10 years ago

From @nwc10

On Fri\, Sep 20\, 2013 at 04​:00​:25PM +0100\, Zefram wrote​:

Nicholas Clark wrote​:

at which point\, applying any binary->binary or text->text layer *stacks* it at the right point\, and applying any binary->text layer swaps out the previous.

It sounds like the concept of "applying" a layer has been overloaded beyond usefulness. Inserting layers at different positions are different operations\, and replacing a layer (or group of layers) is different again.

Yes\, half the time I agree with you here. It's too complex to be useful.

But it's bugging me that the only 2 frequent operations a programmer does are

1) State that the file is binary 2) State that the file is text in a particular encoding

with the bothersome problem that the default is text\, with platform specific line ending post-processing\, which should be retained on a text file even if the encoding is changed from the default.

And that (1) and (2) above ought to be easy to do\, without needing to resort to a more flexible syntax.

Nicholas Clark

p5pRT commented 10 years ago

From @cpansprout

On Fri Sep 20 05​:59​:34 2013\, nicholas wrote​:

I believe that the second half of Jarkko's quote applies​:

Documenting bugs before they're found is kinda hard\.
Can I borrow your time machine? Mine won't start\.

Failing that\, we wrestle with the punchline to the "Irishman giving directions" joke\, and wonder how to retrofit sanity.

Gah. It keeps coming back to this - to handle Unicode properly\, you need a type system. Or at least enough of a type system to distinguish "text" from "binary".

I usually just work around the whole issue with explicit encode/decode. Also\, where possible\, I avoid UTF-16 and Windows. Life is so much simpler that way!

Nicholas Clark

* feed is a word visibly distinct from input. I'm failing to find a suitable antonym for this meaning of feed.

The opposite of feed is usually starve. But that doesn’t work here. Maybe spew? Extort?

--

Father Chrysostomos

p5pRT commented 10 years ago

From @Leont

On Fri\, Sep 20\, 2013 at 7​:27 AM\, Tony Cook via RT \<perlbug-followup@​perl.org

wrote​:

Hello\,

I'm using ActivePerl 5.12.2 on Windows 7.

Perl for Win32 has a feature to convert a single "LF" (without preceding "CR") to "CRLF"\, but my perl seems to determine what "LF" is on UTF-16 incorrectly. In ANSI and UTF-8 files\, LF's bytecode is "0A"; in UTF-16\, LF should be "000A" (Big Endian) or "0A00" (Little Endian)\, but my perl seems to regard single "0A" as LF too! Thus\, she will do the wrong thing\, which is adding a "0D" before "0A" (my perl also regard "0D" as CR\, the right CR in UTF-16 should be "000A").

I believe this is a known problem with the way the default :crlf layer works on Win32.

Since the layer is immediately on top of the :unix layer\, it's working at a byte level\, adding CRs to the bytes *after* translation from characters.

This means you get other broken behaviour\, such as inserting a 0d byte before characters in the U+AXX range​:

C​:\Users\tony>perl -e "open my $fh\, '>​:encoding(utf16be)'\, 'foo.txt' or die $!; print $fh qq(\x{a90}hello\n)"

C​:\Users\tony>perl -e "binmode STDIN; $/ = \16; while (\<>) { print unpack('H*'\, $_)\, qq'\n' }" \<foo.txt 0d0a9000680065006c006c006f000d0a

The workaround (or perhaps the only real fix) is to clear the :crlf layer and add it back on above your unicode layer​:

C​:\Users\tony>perl -e "open my $fh\, '>​:raw​:encoding(utf16be)​:crlf'\, 'foo.txt' or die $!; print $fh qq(\x{a90}hello\n)"

All correct. I wrote the "​:text" pseudo layer that shortens that to '​:text(utf-16be)'\, except that it will not to that whole dance on unix systems.

Also note that before 5.14\, binmode $fh\, '​:raw​:encoding(utf-16be)​:crlf' did not work correctly.

The only way I can see to fix this would be to make :crlf special\, so it always remains on top\, but I suspect that's going to be fairly ugly from an implementation point of view - do we make other layers special too?

(/me avoids going wild with speculation)

The way to correct this is to make open be sensible. That is not a trivial problem.

Leon

p5pRT commented 10 years ago

From robertmay@cpan.org

On 20 September 2013 16​:24\, Father Chrysostomos via RT \perlbug\-followup@&#8203;perl\.org wrote​:

* feed is a word visibly distinct from input. I'm failing to find a suitable antonym for this meaning of feed.

The opposite of feed is usually starve. But that doesn’t work here. Maybe spew? Extort?

drain?

p5pRT commented 10 years ago

From @ap

* Nicholas Clark \nick@&#8203;ccl4\.org [2013-09-20 15​:00]​:

I guess it's a kind of (emergent) flaw with the whole design of layers. In that layers can meaningfully be anything of

text \-> binary    \(eg Unicode \-> UTF\-8\)
binary \-> binary  \(eg gzip\)
binary \-> text    \(eg uuencode\, or these days Base64\)
text \-> text      \(pedantically rot13\)

(in terms of output) (I think I've read that Python 3 went too far the other way on this by banning all but the first)

I think on that categorisation LF -> CRLF would be text -> text.

(Certainly for output it's not binary -> text or binary -> binary\, as it corrupts binary data.)

So the design of layers ought to categorise their feed and not-feed* sides as text or binary\, and forbid plugging the wrong sorts together.

If we had that\, then attempting to push UTF-16 atop CRLF would be an error immediately. But\, of course\, the DWIM approach is that pushing UTF-16 onto CRLF would cause UTF-16 to burrow under CRLF\, given that issuing an error of the form of "I can see what you're trying to do\, but I'm not going to help you" isn't very nice.

I believe that the second half of Jarkko's quote applies​:

Documenting bugs before they're found is kinda hard\.
Can I borrow your time machine? Mine won't start\.

Failing that\, we wrestle with the punchline to the "Irishman giving directions" joke\, and wonder how to retrofit sanity.

Do we want to provide first-class support for layer cakes like this?

  (text → binary) (binary → text) (text → binary)

Because if we don’t\, then the solution would seem to be very easy\, at least at the conceptual level​: make each handle have two stacks\, one for (text → text) layers and another one for (binary → binary)\, plus a single slot for (text ↔ binary). And then you *set* this single slot (no pushing/popping there)\, plus you push/pop the other types of layers on their respective stacks.

In that design\, we also have a (text ↔ binary) layer (named “derp”? :-)) whose output direction implements the current behaviour of `print` and friends\, wherein they try to downgrade a string for output but warn and output the utf8 buffer as bytes if they can’t.

(As far as I can see there is no reason to have (binary → text) layers if layers can go in both directions depending on whether they’re applied to input or output (as is currently the case – you push :encoding(UTF-8) no matter whether it’s an input or output handle).)

“Derp” is then the default (text ↔ binary) layer for handles on which nothing else has been set. This solves the question of “how do I set :crlf on an otherwise unconfigured handle if layers are typed?”

Note that the solves the problem with pushing UTF-16 onto CRLF\, because in this design\, you don’t do that – you *set* UTF-16 for the conversion slot\, and in any case the CRLF layer is in a stack by itself so if you push any (binary → binary) layers\, they will push “under” the CRLF layer automatically. So by this design PerlIO will DTRT automatically.

I’d suggest a migration in which (text → text)\, (binary → binary) and (text ↔ binary) layers all move into different namespaces\, so that once completed it becomes impossible to even *say* the wrong thing.

Note that even layer cakes can be supported as a second-class construct\, by providing a reverse-direction pseudo-layer that implements a nested layer stack\, which you can then push onto an layer stack as a unit.

(This even neatly solves the question of how code is supposed to keep track of the relative ordering of layers in really complex situations. If you have build layer cakes out of (possibly recursively) nested stacks then each participating bit of code only needs to care about the nested stack it is managing itself\, and by virtue of the fixed order of layers within a handle or a pseudo-layer\, the overall resulting pipeline is guaranteed to assemble into something that makes sense.)

Now – how we turn what we have now into the system I outlined is quite another matter…

… or maybe it ain’t? How hard do the PerlIO people here think this might be? (Leon?)

* feed is a word visibly distinct from input. I'm failing to find a suitable antonym for this meaning of feed.

Spew? :-)

-- No trees were harmed in the transmission of this email but trillions of electrons were excited to participate.

p5pRT commented 10 years ago

From @ap

* Robert May \robertmay@&#8203;cpan\.org [2013-09-20 17​:45]​:

On 20 September 2013 16​:24\, Father Chrysostomos wrote​:

The opposite of feed is usually starve. But that doesn’t work here. Maybe spew? Extort?

drain?

Ah\, y’all jogged my memory. Nicholas is looking for “source” and “sink” I think – cf. \https://en.wikipedia.org/wiki/Sink_%28computing%29.

-- Aristotle Pagaltzis // \<http​://plasmasturm.org/>

p5pRT commented 10 years ago

From @Leont

On Fri\, Sep 20\, 2013 at 2​:58 PM\, Nicholas Clark \nick@&#8203;ccl4\.org wrote​:

On Thu\, Sep 19\, 2013 at 10​:27​:32PM -0700\, Tony Cook via RT wrote​:

I believe this is a known problem with the way the default :crlf layer works on Win32.

Since the layer is immediately on top of the :unix layer\, it's working at a byte level\, adding CRs to the bytes *after* translation from characters.

The only way I can see to fix this would be to make :crlf special\, so it always remains on top\, but I suspect that's going to be fairly ugly from an implementation point of view - do we make other layers special too?

I guess it's a kind of (emergent) flaw with the whole design of layers. In that layers can meaningfully be anything of

text \-> binary    \(eg Unicode \-> UTF\-8\)
binary \-> binary  \(eg gzip\)
binary \-> text    \(eg uuencode\, or these days Base64\)
text \-> text      \(pedantically rot13\)

(in terms of output) (I think I've read that Python 3 went too far the other way on this by banning all but the first)

I think on that categorisation LF -> CRLF would be text -> text.

(Certainly for output it's not binary -> text or binary -> binary\, as it corrupts binary data.)

Except that PerlIO internally doesn't work in terms of text or binary\, but in terms of latin-1/binary octets versus utf8 octets :-/.

So for example if you'd open a "​:encoding(utf-16be)​:bytes". You'd read UTF-16 converted to UTF-8 and then interpreted as Latin-1. Obviously not something someone would deliberately do\, but the fact that you can do it accidentally is bad enough.

Another dimension of this is that some layers only make sense at the bottom (e.g. :unix)\, and others only above such a bottom layer (e.g. most layers).

So the design of layers ought to categorise their feed and not-feed* sides as text or binary\, and forbid plugging the wrong sorts together.

If we had that\, then attempting to push UTF-16 atop CRLF would be an error immediately. But\, of course\, the DWIM approach is that pushing UTF-16 onto CRLF would cause UTF-16 to burrow under CRLF\, given that issuing an error of the form of "I can see what you're trying to do\, but I'm not going to help you" isn't very nice.

Yes\, absolutely!

I believe that the second half of Jarkko's quote applies​:

Documenting bugs before they're found is kinda hard\.
Can I borrow your time machine? Mine won't start\.

Failing that\, we wrestle with the punchline to the "Irishman giving directions" joke\, and wonder how to retrofit sanity.

Gah. It keeps coming back to this - to handle Unicode properly\, you need a type system. Or at least enough of a type system to distinguish "text" from "binary".

I'm wondering if it's really too late. Given how much brokenness there is in this area

* feed is a word visibly distinct from input. I'm failing to find a suitable antonym for this meaning of feed.

I prefer to just call them top and bottom.

Leon

p5pRT commented 10 years ago

From @Leont

On Fri\, Sep 20\, 2013 at 6​:26 PM\, Aristotle Pagaltzis \pagaltzis@&#8203;gmx\.dewrote​:

Ah\, y’all jogged my memory. Nicholas is looking for “source” and “sink” I think – cf. \https://en.wikipedia.org/wiki/Sink_%28computing%29.

IO goes in both directions; so one side will be the source for input but the sink for output and vice versa.

Source and sink would be *terribly* confusing terms.

Leon