Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.99k stars 557 forks source link

tell / getline problems on Win32 with unix-delimited files opened with encoding(UTF-8) #13471

Open p5pRT opened 10 years ago

p5pRT commented 10 years ago

Migrated from rt.perl.org#120797 (status was 'open')

Searchable as RT120797$

p5pRT commented 10 years ago

From cm.perl@abtela.com

This is a bug report for perl from cm.perl@​abtela.com\, generated with the help of perlbug 1.39 running under perl 5.18.1.

When you open a unix-delimited file (i.e.\, lines end in LF\, not CRLF) on Win32 with   my $io = new IO​::File($filename\, "\<​:encoding(...)") a call to   tell $io; seem to corrupt the handle / layers state to the point that the next call to $io->getline does not return the next line as expected.

This is a serious problem as it precludes any use of   $io->input_line_number (which makes a call to tell) for unix-delimited files opened this way on Win32.

This seems to be the reason why Pod-Eventual-0.094001 fails tests on Win32. It calls input_line_number on handles opened by Mixin-Linewise-0.102 with a default encoding of "​:encoding(UTF-8)" (the introduction of this default encoding was apparently the rationale for the latest versions of these two modules).

Dist-Zilla\, Pod-Weaver\, Config-INI and other important CPAN distribs depend on these.

The attached test file io_tell_encoding.t illustrates the problem (it has been coded in the style of dist/IO/t/io_linenum.t in the hope that it would facilitate its integration... If Test​::More can be used I will be happy to provide a more explicit version).

What this test program does is first establish a "reference" version of the list of lines to be read (using a 'traditional' open) and then reads again the same file using various "extensions" :

  my $io = IO​::File->new($File\, "\<​:$encoding") or die $!;

for the following values of $encoding :

  "encoding(UTF-8)"\, "encoding(iso-8859-1)"\, ""\, "raw"\, "crlf"\, "utf8"

any of which should be able to read without problem a pure ASCII\, unix-delimited file. Each line read (with $io->getline) is compared with the reference.

In a first batch of tests there is a call to tell($io) after each $io->getline. This call is omitted in a second batch.

The test file is the test program itself (the comments at the end of the program text were crafted to make it easier to see the problem).

When stored as a unix (LF delimited) file\, this program yields

Taisha​:\~/devbin/tmp $ perl io_tell_encoding.t 1..12 # Running under perl version 5.018001 for MSWin32 # Current time local​: Mon Dec 16 01​:45​:11 2013 # Current time GMT​: Mon Dec 16 00​:45​:11 2013 # Using Test.pm version 1.26 not ok 1 # Test 1 got​: "line 1\, expected 'my $File;\n'\, got '5a6a7a8a9\n'" (io_tell_encoding.t at line 40) # Expected​: "OK" (encoding = encoding(UTF-8)\, tell = 1) # io_tell_encoding.t line 40 is​: ok(test($encoding\, $tell)\, "OK"\, "encoding = $encoding\, tell = $tell"); not ok 2 # Test 2 got​: "line 1\, expected 'my $File;\n'\, got '5a6a7a8a9\n'" (io_tell_encoding.t at line 40 fail #2) # Expected​: "OK" (encoding = encoding(iso-8859-1)\, tell = 1) ok 3 ok 4 ok 5 ok 6 ok 7 ok 8 ok 9 ok 10 ok 11 ok 12 Taisha​:\~/devbin/tmp $

We see that the test fails only for "encoding(...)" when tell($io) is called.

If the program is stored as a CRLF delimited file it yields instead

Taisha​:\~/devbin/tmp $ perl io_tell_encoding.t 1..12 # Running under perl version 5.018001 for MSWin32 # Current time local​: Mon Dec 16 01​:47​:14 2013 # Current time GMT​: Mon Dec 16 00​:47​:14 2013 # Using Test.pm version 1.26 ok 1 ok 2 ok 3 not ok 4 # Test 4 got​: "line 0\, expected '#!./perl\n'\, got '#!./perl\r\n'" (io_tell_encoding.t at line 40 fail #4) # Expected​: "OK" (encoding = raw\, tell = 1) # io_tell_encoding.t line 40 is​: ok(test($encoding\, $tell)\, "OK"\, "encoding = $encoding\, tell = $tell"); ok 5 ok 6 ok 7 ok 8 ok 9 not ok 10 # Test 10 got​: "line 0\, expected '#!./perl\n'\, got '#!./perl\r\n'" (io_tell_encoding.t at line 40 fail #10) # Expected​: "OK" (encoding = raw\, tell = 0) ok 11 ok 12 Taisha​:\~/devbin/tmp $

now the only encoding that fails is '​:raw'\, which is normal and unrelated to this ticket.

I have tried to investigate further but after a few hours concluded that this problem was way over my head :(

Thank you for your time and attention.


Flags​:   category=core   severity=critical


Site configuration information for perl 5.18.1​:

Configured by strawberry-perl at Tue Aug 13 19​:21​:46 2013.

Summary of my perl5 (revision 5 version 18 subversion 1) configuration​:

  Platform​:   osname=MSWin32\, osvers=4.0\, archname=MSWin32-x86-multi-thread-64int   uname='Win32 strawberry-perl 5.18.1.1 #1 Tue Aug 13 19​:20​:13 2013 i386'   config_args='undef'   hint=recommended\, useposix=true\, d_sigaction=undef   useithreads=define\, usemultiplicity=define   useperlio=define\, d_sfio=undef\, uselargefiles=define\, usesocks=undef   use64bitint=define\, use64bitall=undef\, uselongdouble=undef   usemymalloc=n\, bincompat5005=undef   Compiler​:   cc='gcc'\, ccflags =' -s -O2 -DWIN32 -DPERL_TEXTMODE_SCRIPTS -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -fno-strict-aliasing -mms-bitfields'\,   optimize='-s -O2'\,   cppflags='-DWIN32'   ccversion=''\, gccversion='4.7.3'\, gccosandvers=''   intsize=4\, longsize=4\, ptrsize=4\, doublesize=8\, byteorder=12345678   d_longlong=define\, longlongsize=8\, d_longdbl=define\, longdblsize=12   ivtype='long long'\, ivsize=8\, nvtype='double'\, nvsize=8\, Off_t='long long'\, lseeksize=8   alignbytes=8\, prototype=define   Linker and Libraries​:   ld='g++.exe'\, ldflags ='-s -L"E​:\cm\devbin\strawberry-perl-5.18.1.1-32bit-portable\perl\lib\CORE" -L"E​:\cm\devbin\strawberry-perl-5.18.1.1-32bit-portable\c\lib"'   libpth=E​:\cm\devbin\strawberry-perl-5.18.1.1-32bit-portable\c\lib E​:\cm\devbin\strawberry-perl-5.18.1.1-32bit-portable\c\i686-w64-mingw32\lib   libs=-lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32   perllibs=-lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32   libc=\, so=dll\, useshrplib=true\, libperl=libperl518.a   gnulibc_version=''   Dynamic Linking​:   dlsrc=dl_win32.xs\, dlext=dll\, d_dlsymun=undef\, ccdlflags=' '   cccdlflags=' '\, lddlflags='-mdll -s -L"E​:\cm\devbin\strawberry-perl-5.18.1.1-32bit-portable\perl\lib\CORE" -L"E​:\cm\devbin\strawberry-perl-5.18.1.1-32bit-portable\c\lib"'

Locally applied patches​:


@​INC for perl 5.18.1​:   E​:/cm/devbin/strawberry-perl-5.18.1.1-32bit-portable/perl/site/lib   E​:/cm/devbin/strawberry-perl-5.18.1.1-32bit-portable/perl/vendor/lib   E​:/cm/devbin/strawberry-perl-5.18.1.1-32bit-portable/perl/lib   .


Environment for perl 5.18.1​:   CYGWIN=nodosfilewarning   HOME=e​:/cm   LANG (unset)   LANGUAGE (unset)   LD_LIBRARY_PATH (unset)   LOGDIR (unset)

PATH=E​:\cm\devbin\strawberry-perl-5.18.1.1-32bit-portable\perl\site\bin;E​:\cm\devbin\strawberry-perl-5.18.1.1-32bit-portable\perl\bin;E​:\cm\devbin\strawberry-perl-5.18.1.1-32bit-portable\c\bin;C​:\Windows\system32;C​:\Windows;C​:\Windows\System32\Wbem;C​:\Windows\System32\WindowsPowerShell\v1.0\;C​:\Program Files (x86)\PuTTY;C​:\Program Files (x86)\OpenOffice.org 3\program;C​:\Program Files (x86)\QT Lite\QTSystem;c​:\Program Files\WinRAR;C​:\Program Files (x86)\Calibre2\   PERL_BADLANG (unset)   SHELL (unset)

p5pRT commented 10 years ago

From cm.perl@abtela.com

#!./perl my $File; my @​lines; my @​encodings;

BEGIN {   $File = __FILE__;   require strict; import strict;   @​lines = do {   open(my $f\, "\<"\, $File) or die $!;   \<$f>;   };

  @​encodings = ("encoding(UTF-8)"\, "encoding(iso-8859-1)"\, ""\, "raw"\, "crlf"\, "utf8"); }

use Test;

BEGIN { plan tests => 2*@​encodings }

use IO​::File;

sub test {   my ($encoding\, $tell) = @​_;   my $io = IO​::File->new($File\, "\<​:$encoding") or die $!;   my $cnt = 0;   while (defined (my $line = $io->getline)) {   $line eq $lines[$cnt]   or return "line $cnt\, expected '$lines[$cnt]'\, got '$line'";   if ($tell) {   tell $io;   }   ++$cnt;   }   return "OK"; }

for my $tell (1\, 0) {   for my $encoding (@​encodings) {   ok(test($encoding\, $tell)\, "OK"\, "encoding = $encoding\, tell = $tell");   } } #a0a1a2a3a4a5a6a7a8a9 #b0b1b2b3b4b5b6b7b8b9

p5pRT commented 10 years ago

From cm.perl@abtela.com

Le 16/12/2013 02​:36\, Christian Millour (via RT) a écrit :

When you open a unix-delimited file (i.e.\, lines end in LF\, not CRLF) on Win32 with my $io = new IO​::File($filename\, "\<​:encoding(...)") a call to tell $io; seem to corrupt the handle / layers state to the point that the next call to $io->getline does not return the next line as expected.

Note that the problem is not limited to Win32. You get the same misbehavior on unix when using "\<​:crlf​:encoding(whatever)" on a LF-delimited file.

What is really needed here is a "permissive" :crlf layer\, i.e. one that will allow reading either LF- or CRLF-delimited files. This is more or less already the case in practice\, unless you use an encoding layer with :crlf. In that latter case\, PerlIOEncode_flush() calls PerlIO_unread()\, which resolves to PerlIOCrlf_unread()\, which currently always translate back '\n' as a CR LF pair irrespective of the original content\, potentially thrashing the buffer in the process.

The attached tentative patch implements a form of autodetection of the delimiter actually used in the stream. It involves a new PerlIO flag\, currently named (this is negotiable :)) PERLIO_F_CRLFSAWLF. This flag is set by PerlIOCrlf_get_cnt() on finding a LF (actually a NATIVE_0xd). PerlIOCrlf_unread() then does its specific work only if the flag got set\, and otherwise does a regulard PerlIOBuf_unread(). All bets are off though if the file being read uses both LF and CRLF.

This patch seems to work with blead on linux and Win32\, at least as a proof of concept. Dedicating an PerlIO flag for this might look as a stiff price to pay but it keeps things simple (I thought for a time that playing with (PerlIOSelf(f\, PerlIOCrlf))->nl might be enough but have not been able to build a conviction that it would work in all cases).

The second patch contains a modified version of io_tell_encoding.t to showcase the problem and test solutions on unix as well as Win32.

Opinions / corrections / tests / smokes / alternatives welcome :)

Regards\,

--Christian

p5pRT commented 10 years ago

From cm.perl@abtela.com

0001-add-test-file-for-permissive-crlf-layer-see-120797.patch ```diff From b769bca8110f69fbfbda61cfc9cf210e3cbc0b80 Mon Sep 17 00:00:00 2001 From: Christian Millour Date: Tue, 17 Dec 2013 20:34:45 +0100 Subject: [PATCH] add test file for permissive crlf layer (see #120797) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------------1.7.4" This is a multi-part message in MIME format. --------------1.7.4 Content-Type: text/plain; charset=UTF-8; format=fixed Content-Transfer-Encoding: 8bit --- dist/IO/t/io_tell_encoding.t | 49 ++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 49 insertions(+), 0 deletions(-) create mode 100755 dist/IO/t/io_tell_encoding.t --------------1.7.4 Content-Type: text/x-patch; name="0001-add-test-file-for-permissive-crlf-layer-see-120797.patch" Content-Transfer-Encoding: 8bit Content-Disposition: attachment; filename="0001-add-test-file-for-permissive-crlf-layer-see-120797.patch" diff --git a/dist/IO/t/io_tell_encoding.t b/dist/IO/t/io_tell_encoding.t new file mode 100755 index 0000000..5283e26 --- /dev/null +++ b/dist/IO/t/io_tell_encoding.t @@ -0,0 +1,49 @@ +#!./perl +# Test permissive read of LF-delimited files on Win32 with encoding +# layers (see #120797). +# This file should contain only ASCII and be stored LF-delimited to +# exhibit the potential problems. +my @encodings; + +BEGIN { + require strict; import strict; + @encodings = (":crlf:encoding(UTF-8)", ":crlf:encoding(iso-8859-1)"); +} + +use Test; + +BEGIN { plan tests => 2*@encodings } + +use IO::File; + +my $File = __FILE__; +my @lines = do { + open(my $f, "<", $File) or die $!; + <$f>; +}; + +sub test { + my ($encoding, $tell, $actualref) = @_; + study; + my $io = IO::File->new($File, "<$encoding") or die $!; + $$actualref = join ":", q{}, PerlIO::get_layers($io); + my $cnt = 0; + while (defined (my $line = $io->getline)) { + $line eq $lines[$cnt] + or return "line $cnt, expected '$lines[$cnt]', got '$line'"; + if ($tell) { + () = tell $io; + } + ++$cnt; + } + return "OK"; +} + +for my $tell (1, 0) { + for my $encoding (@encodings) { + my $actual; + ok(test($encoding, $tell, \$actual), "OK", "encoding = $encoding, actual = $actual, tell = $tell"); + } +} +#a0a1a2a3a4a5a6a7a8a9 +#b0b1b2b3b4b5b6b7b8b9 --------------1.7.4-- ```
p5pRT commented 10 years ago

From cm.perl@abtela.com

0001-implement-permissive-crlf-layer-see-120797.patch ```diff From 6d5e12cf2c83280bfea291f35006279dedf14496 Mon Sep 17 00:00:00 2001 From: Christian Millour Date: Tue, 17 Dec 2013 20:25:28 +0100 Subject: [PATCH] implement permissive :crlf layer (see #120797) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------------1.7.4" This is a multi-part message in MIME format. --------------1.7.4 Content-Type: text/plain; charset=UTF-8; format=fixed Content-Transfer-Encoding: 8bit --- perlio.c | 14 +++++++++++++- perliol.h | 1 + 2 files changed, 14 insertions(+), 1 deletions(-) --------------1.7.4 Content-Type: text/x-patch; name="0001-implement-permissive-crlf-layer-see-120797.patch" Content-Transfer-Encoding: 8bit Content-Disposition: attachment; filename="0001-implement-permissive-crlf-layer-see-120797.patch" diff --git a/perlio.c b/perlio.c index d89b9bb..5f028af 100644 --- a/perlio.c +++ b/perlio.c @@ -4477,9 +4477,18 @@ PERLIO_FUNCS_DECL(PerlIO_pending) = { * replaced by LF, or to the last CR of the buffer. In the former case * the caller thinks that the buffer ends at c->nl + 1, in the latter * that it ends at c->nl; these two cases can be distinguished by - * *c->nl. c->nl is set during _getcnt() call, and unset during + * *c->nl. c->nl is set during _get_cnt() call, and unset during * _unread() and _flush() calls. + * * It only matters for read operations. + * + * The flag PERLIO_F_CRLFSAWCR is used to allow "permissive" CRLF, for + * e.g. reading LF (unix) delimited files on Win32, the issue being to + * prevent _unread() from translating back indiscriminately '\n' as + * a CR LF pair in that case (this is an issue because _unread() may + * be used quite a lot if there is an encoding(xxx) layer upstream). + * So the latter translation will be done only if this flag got set, + * which is done by _get_cnt() on finding a NATIVE_0xd. */ typedef struct { @@ -4538,6 +4547,8 @@ SSize_t PerlIOCrlf_unread(pTHX_ PerlIO *f, const void *vbuf, Size_t count) { PerlIOCrlf * const c = PerlIOSelf(f, PerlIOCrlf); + if (!(PerlIOBase(f)->flags & PERLIO_F_CRLF) || !(PerlIOBase(f)->flags & PERLIO_F_CRLFSAWCR)) + return PerlIOBuf_unread(aTHX_ f, vbuf, count); if (c->nl) { /* XXXX Shouldn't it be done only if b->ptr > c->nl? */ *(c->nl) = NATIVE_0xd; c->nl = NULL; @@ -4603,6 +4614,7 @@ PerlIOCrlf_get_cnt(pTHX_ PerlIO *f) while (nl < b->end && *nl != NATIVE_0xd) nl++; if (nl < b->end && *nl == NATIVE_0xd) { + PerlIOBase(f)->flags |= PERLIO_F_CRLFSAWCR; test: if (nl + 1 < b->end) { if (nl[1] == NATIVE_0xa) { diff --git a/perliol.h b/perliol.h index 82a3840..e26a331 100644 --- a/perliol.h +++ b/perliol.h @@ -91,6 +91,7 @@ struct _PerlIO { #define PERLIO_F_TTY 0x00800000 #define PERLIO_F_NOTREG 0x01000000 #define PERLIO_F_CLEARED 0x02000000 /* layer cleared but not freed */ +#define PERLIO_F_CRLFSAWCR 0x04000000 #define PerlIOBase(f) (*(f)) #define PerlIOSelf(f,type) ((type *)PerlIOBase(f)) --------------1.7.4-- ```
p5pRT commented 10 years ago

From @Leont

On Tue\, Dec 17\, 2013 at 9​:32 PM\, Christian Millour \cm\.perl@&#8203;abtela\.comwrote​:

Note that the problem is not limited to Win32. You get the same misbehavior on unix when using "\<​:crlf​:encoding(whatever)" on a LF-delimited file.

What is really needed here is a "permissive" :crlf layer\, i.e. one that will allow reading either LF- or CRLF-delimited files. This is more or less already the case in practice\, unless you use an encoding layer with :crlf. In that latter case\, PerlIOEncode_flush() calls PerlIO_unread()\, which resolves to PerlIOCrlf_unread()\, which currently always translate back '\n' as a CR LF pair irrespective of the original content\, potentially thrashing the buffer in the process.

The attached tentative patch implements a form of autodetection of the delimiter actually used in the stream. It involves a new PerlIO flag\, currently named (this is negotiable :)) PERLIO_F_CRLFSAWLF. This flag is set by PerlIOCrlf_get_cnt() on finding a LF (actually a NATIVE_0xd). PerlIOCrlf_unread() then does its specific work only if the flag got set\, and otherwise does a regulard PerlIOBuf_unread(). All bets are off though if the file being read uses both LF and CRLF.

This patch seems to work with blead on linux and Win32\, at least as a proof of concept. Dedicating an PerlIO flag for this might look as a stiff price to pay but it keeps things simple (I thought for a time that playing with (PerlIOSelf(f\, PerlIOCrlf))->nl might be enough but have not been able to build a conviction that it would work in all cases).

The second patch contains a modified version of io_tell_encoding.t to showcase the problem and test solutions on unix as well as Win32.

Opinions / corrections / tests / smokes / alternatives welcome :)

That whole method is an optimization anyway. I'm wondering if getting rid of it wouldn't be a better solution. It makes ungetc less efficient though\, I'm not sure how often it gets used (I thought I had previously applied a patch to make eof not use it by default\, but it appears not). May want to reduce the usage before applying this though.

Leon

p5pRT commented 10 years ago

From @Leont

0001-Use-more-naive-unreading-for-crlf.patch ```diff From 509698a69731983ff66ac081beed45186e31b7f3 Mon Sep 17 00:00:00 2001 From: Leon Timmermans Date: Thu, 19 Dec 2013 01:01:09 +0100 Subject: [PATCH] Use more naive unreading for crlf --- perlio.c | 57 +-------------------------------------------------------- 1 files changed, 1 insertions(+), 56 deletions(-) diff --git a/perlio.c b/perlio.c index d89b9bb..718b2b5 100644 --- a/perlio.c +++ b/perlio.c @@ -4533,61 +4533,6 @@ PerlIOCrlf_pushed(pTHX_ PerlIO *f, const char *mode, SV *arg, PerlIO_funcs *tab) return code; } - -SSize_t -PerlIOCrlf_unread(pTHX_ PerlIO *f, const void *vbuf, Size_t count) -{ - PerlIOCrlf * const c = PerlIOSelf(f, PerlIOCrlf); - if (c->nl) { /* XXXX Shouldn't it be done only if b->ptr > c->nl? */ - *(c->nl) = NATIVE_0xd; - c->nl = NULL; - } - if (!(PerlIOBase(f)->flags & PERLIO_F_CRLF)) - return PerlIOBuf_unread(aTHX_ f, vbuf, count); - else { - const STDCHAR *buf = (const STDCHAR *) vbuf + count; - PerlIOBuf *b = PerlIOSelf(f, PerlIOBuf); - SSize_t unread = 0; - if (PerlIOBase(f)->flags & PERLIO_F_WRBUF) - PerlIO_flush(f); - if (!b->buf) - PerlIO_get_base(f); - if (b->buf) { - if (!(PerlIOBase(f)->flags & PERLIO_F_RDBUF)) { - b->end = b->ptr = b->buf + b->bufsiz; - PerlIOBase(f)->flags |= PERLIO_F_RDBUF; - b->posn -= b->bufsiz; - } - while (count > 0 && b->ptr > b->buf) { - const int ch = *--buf; - if (ch == '\n') { - if (b->ptr - 2 >= b->buf) { - *--(b->ptr) = NATIVE_0xa; - *--(b->ptr) = NATIVE_0xd; - unread++; - count--; - } - else { - /* If b->ptr - 1 == b->buf, we are undoing reading 0xa */ - *--(b->ptr) = NATIVE_0xa; /* Works even if 0xa == - '\r' */ - unread++; - count--; - } - } - else { - *--(b->ptr) = ch; - unread++; - count--; - } - } - } - if (count > 0) - unread += PerlIOBase_unread(aTHX_ f, (const STDCHAR *) vbuf + unread, count); - return unread; - } -} - /* XXXX This code assumes that buffer size >=2, but does not check it... */ SSize_t PerlIOCrlf_get_cnt(pTHX_ PerlIO *f) @@ -4804,7 +4749,7 @@ PERLIO_FUNCS_DECL(PerlIO_crlf) = { PerlIOBase_fileno, PerlIOBuf_dup, PerlIOBuf_read, /* generic read works with ptr/cnt lies */ - PerlIOCrlf_unread, /* Put CR,LF in buffer for each '\n' */ + PerlIOBase_unread, PerlIOCrlf_write, /* Put CR,LF in buffer for each '\n' */ PerlIOBuf_seek, PerlIOBuf_tell, -- 1.7.1 ```
p5pRT commented 10 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 10 years ago

From @nwc10

On Thu\, Dec 19\, 2013 at 01​:58​:05AM +0100\, Leon Timmermans wrote​:

On Tue\, Dec 17\, 2013 at 9​:32 PM\, Christian Millour \cm\.perl@&#8203;abtela\.comwrote​:

Given this​:

All bets are off though if the file being read uses both LF and CRLF.

then​:

That whole method is an optimization anyway. I'm wondering if getting rid

it's not much of an optimsiation if it breaks things.

of it wouldn't be a better solution. It makes ungetc less efficient though\, I'm not sure how often it gets used (I thought I had previously applied a patch to make eof not use it by default\, but it appears not). May want to reduce the usage before applying this though.

Is there any way to gauge how often ungetc() is called?

[snip patch which removes a chunk of code]

I like the direction that your suggested patch is taking the PerlIO codebase.

Nicholas Clark

p5pRT commented 7 years ago

From @jkeenan

On Tue\, 17 Dec 2013 20​:33​:33 GMT\, cm.perl@​abtela.com wrote​:

Le 16/12/2013 02​:36\, Christian Millour (via RT) a écrit :

When you open a unix-delimited file (i.e.\, lines end in LF\, not CRLF) on Win32 with my $io = new IO​::File($filename\, "\<​:encoding(...)") a call to tell $io; seem to corrupt the handle / layers state to the point that the next call to $io->getline does not return the next line as expected.

Note that the problem is not limited to Win32. You get the same misbehavior on unix when using "\<​:crlf​:encoding(whatever)" on a LF-delimited file.

What is really needed here is a "permissive" :crlf layer\, i.e. one that will allow reading either LF- or CRLF-delimited files. This is more or less already the case in practice\, unless you use an encoding layer with :crlf. In that latter case\, PerlIOEncode_flush() calls PerlIO_unread()\, which resolves to PerlIOCrlf_unread()\, which currently always translate back '\n' as a CR LF pair irrespective of the original content\, potentially thrashing the buffer in the process.

The attached tentative patch implements a form of autodetection of the delimiter actually used in the stream. It involves a new PerlIO flag\, currently named (this is negotiable :)) PERLIO_F_CRLFSAWLF. This flag is set by PerlIOCrlf_get_cnt() on finding a LF (actually a NATIVE_0xd). PerlIOCrlf_unread() then does its specific work only if the flag got set\, and otherwise does a regulard PerlIOBuf_unread(). All bets are off though if the file being read uses both LF and CRLF.

This patch seems to work with blead on linux and Win32\, at least as a proof of concept. Dedicating an PerlIO flag for this might look as a stiff price to pay but it keeps things simple (I thought for a time that playing with (PerlIOSelf(f\, PerlIOCrlf))->nl might be enough but have not been able to build a conviction that it would work in all cases).

The second patch contains a modified version of io_tell_encoding.t to showcase the problem and test solutions on unix as well as Win32.

Opinions / corrections / tests / smokes / alternatives welcome :)

To make this discussion more visible\, I have created the following smoke branch​:

smoke-me/jkeenan/120797-perlio

Regards\,

--Christian

-- James E Keenan (jkeenan@​cpan.org)

jkeenan commented 4 years ago

Mentioned on list recently, e.g., https://www.nntp.perl.org/group/perl.perl5.porters/2020/07/msg257916.html

Leont commented 4 years ago

Has anyone tried to run my patch? That would be helpful