Open p5pRT opened 10 years ago
This is a bug report for perl from cm.perl@abtela.com\, generated with the help of perlbug 1.39 running under perl 5.18.1.
When you open a unix-delimited file (i.e.\, lines end in LF\, not CRLF) on Win32 with my $io = new IO::File($filename\, "\<:encoding(...)") a call to tell $io; seem to corrupt the handle / layers state to the point that the next call to $io->getline does not return the next line as expected.
This is a serious problem as it precludes any use of $io->input_line_number (which makes a call to tell) for unix-delimited files opened this way on Win32.
This seems to be the reason why Pod-Eventual-0.094001 fails tests on Win32. It calls input_line_number on handles opened by Mixin-Linewise-0.102 with a default encoding of ":encoding(UTF-8)" (the introduction of this default encoding was apparently the rationale for the latest versions of these two modules).
Dist-Zilla\, Pod-Weaver\, Config-INI and other important CPAN distribs depend on these.
The attached test file io_tell_encoding.t illustrates the problem (it has been coded in the style of dist/IO/t/io_linenum.t in the hope that it would facilitate its integration... If Test::More can be used I will be happy to provide a more explicit version).
What this test program does is first establish a "reference" version of the list of lines to be read (using a 'traditional' open) and then reads again the same file using various "extensions" :
my $io = IO::File->new($File\, "\<:$encoding") or die $!;
for the following values of $encoding :
"encoding(UTF-8)"\, "encoding(iso-8859-1)"\, ""\, "raw"\, "crlf"\, "utf8"
any of which should be able to read without problem a pure ASCII\, unix-delimited file. Each line read (with $io->getline) is compared with the reference.
In a first batch of tests there is a call to tell($io) after each $io->getline. This call is omitted in a second batch.
The test file is the test program itself (the comments at the end of the program text were crafted to make it easier to see the problem).
When stored as a unix (LF delimited) file\, this program yields
Taisha:\~/devbin/tmp $ perl io_tell_encoding.t 1..12 # Running under perl version 5.018001 for MSWin32 # Current time local: Mon Dec 16 01:45:11 2013 # Current time GMT: Mon Dec 16 00:45:11 2013 # Using Test.pm version 1.26 not ok 1 # Test 1 got: "line 1\, expected 'my $File;\n'\, got '5a6a7a8a9\n'" (io_tell_encoding.t at line 40) # Expected: "OK" (encoding = encoding(UTF-8)\, tell = 1) # io_tell_encoding.t line 40 is: ok(test($encoding\, $tell)\, "OK"\, "encoding = $encoding\, tell = $tell"); not ok 2 # Test 2 got: "line 1\, expected 'my $File;\n'\, got '5a6a7a8a9\n'" (io_tell_encoding.t at line 40 fail #2) # Expected: "OK" (encoding = encoding(iso-8859-1)\, tell = 1) ok 3 ok 4 ok 5 ok 6 ok 7 ok 8 ok 9 ok 10 ok 11 ok 12 Taisha:\~/devbin/tmp $
We see that the test fails only for "encoding(...)" when tell($io) is called.
If the program is stored as a CRLF delimited file it yields instead
Taisha:\~/devbin/tmp $ perl io_tell_encoding.t 1..12 # Running under perl version 5.018001 for MSWin32 # Current time local: Mon Dec 16 01:47:14 2013 # Current time GMT: Mon Dec 16 00:47:14 2013 # Using Test.pm version 1.26 ok 1 ok 2 ok 3 not ok 4 # Test 4 got: "line 0\, expected '#!./perl\n'\, got '#!./perl\r\n'" (io_tell_encoding.t at line 40 fail #4) # Expected: "OK" (encoding = raw\, tell = 1) # io_tell_encoding.t line 40 is: ok(test($encoding\, $tell)\, "OK"\, "encoding = $encoding\, tell = $tell"); ok 5 ok 6 ok 7 ok 8 ok 9 not ok 10 # Test 10 got: "line 0\, expected '#!./perl\n'\, got '#!./perl\r\n'" (io_tell_encoding.t at line 40 fail #10) # Expected: "OK" (encoding = raw\, tell = 0) ok 11 ok 12 Taisha:\~/devbin/tmp $
now the only encoding that fails is ':raw'\, which is normal and unrelated to this ticket.
I have tried to investigate further but after a few hours concluded that this problem was way over my head :(
Thank you for your time and attention.
Flags: category=core severity=critical
Site configuration information for perl 5.18.1:
Configured by strawberry-perl at Tue Aug 13 19:21:46 2013.
Summary of my perl5 (revision 5 version 18 subversion 1) configuration:
Platform: osname=MSWin32\, osvers=4.0\, archname=MSWin32-x86-multi-thread-64int uname='Win32 strawberry-perl 5.18.1.1 #1 Tue Aug 13 19:20:13 2013 i386' config_args='undef' hint=recommended\, useposix=true\, d_sigaction=undef useithreads=define\, usemultiplicity=define useperlio=define\, d_sfio=undef\, uselargefiles=define\, usesocks=undef use64bitint=define\, use64bitall=undef\, uselongdouble=undef usemymalloc=n\, bincompat5005=undef Compiler: cc='gcc'\, ccflags =' -s -O2 -DWIN32 -DPERL_TEXTMODE_SCRIPTS -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -fno-strict-aliasing -mms-bitfields'\, optimize='-s -O2'\, cppflags='-DWIN32' ccversion=''\, gccversion='4.7.3'\, gccosandvers='' intsize=4\, longsize=4\, ptrsize=4\, doublesize=8\, byteorder=12345678 d_longlong=define\, longlongsize=8\, d_longdbl=define\, longdblsize=12 ivtype='long long'\, ivsize=8\, nvtype='double'\, nvsize=8\, Off_t='long long'\, lseeksize=8 alignbytes=8\, prototype=define Linker and Libraries: ld='g++.exe'\, ldflags ='-s -L"E:\cm\devbin\strawberry-perl-5.18.1.1-32bit-portable\perl\lib\CORE" -L"E:\cm\devbin\strawberry-perl-5.18.1.1-32bit-portable\c\lib"' libpth=E:\cm\devbin\strawberry-perl-5.18.1.1-32bit-portable\c\lib E:\cm\devbin\strawberry-perl-5.18.1.1-32bit-portable\c\i686-w64-mingw32\lib libs=-lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32 perllibs=-lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32 libc=\, so=dll\, useshrplib=true\, libperl=libperl518.a gnulibc_version='' Dynamic Linking: dlsrc=dl_win32.xs\, dlext=dll\, d_dlsymun=undef\, ccdlflags=' ' cccdlflags=' '\, lddlflags='-mdll -s -L"E:\cm\devbin\strawberry-perl-5.18.1.1-32bit-portable\perl\lib\CORE" -L"E:\cm\devbin\strawberry-perl-5.18.1.1-32bit-portable\c\lib"'
Locally applied patches:
@INC for perl 5.18.1: E:/cm/devbin/strawberry-perl-5.18.1.1-32bit-portable/perl/site/lib E:/cm/devbin/strawberry-perl-5.18.1.1-32bit-portable/perl/vendor/lib E:/cm/devbin/strawberry-perl-5.18.1.1-32bit-portable/perl/lib .
Environment for perl 5.18.1: CYGWIN=nodosfilewarning HOME=e:/cm LANG (unset) LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset)
PATH=E:\cm\devbin\strawberry-perl-5.18.1.1-32bit-portable\perl\site\bin;E:\cm\devbin\strawberry-perl-5.18.1.1-32bit-portable\perl\bin;E:\cm\devbin\strawberry-perl-5.18.1.1-32bit-portable\c\bin;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files (x86)\PuTTY;C:\Program Files (x86)\OpenOffice.org 3\program;C:\Program Files (x86)\QT Lite\QTSystem;c:\Program Files\WinRAR;C:\Program Files (x86)\Calibre2\ PERL_BADLANG (unset) SHELL (unset)
#!./perl my $File; my @lines; my @encodings;
BEGIN { $File = __FILE__; require strict; import strict; @lines = do { open(my $f\, "\<"\, $File) or die $!; \<$f>; };
@encodings = ("encoding(UTF-8)"\, "encoding(iso-8859-1)"\, ""\, "raw"\, "crlf"\, "utf8"); }
use Test;
BEGIN { plan tests => 2*@encodings }
use IO::File;
sub test { my ($encoding\, $tell) = @_; my $io = IO::File->new($File\, "\<:$encoding") or die $!; my $cnt = 0; while (defined (my $line = $io->getline)) { $line eq $lines[$cnt] or return "line $cnt\, expected '$lines[$cnt]'\, got '$line'"; if ($tell) { tell $io; } ++$cnt; } return "OK"; }
for my $tell (1\, 0) { for my $encoding (@encodings) { ok(test($encoding\, $tell)\, "OK"\, "encoding = $encoding\, tell = $tell"); } } #a0a1a2a3a4a5a6a7a8a9 #b0b1b2b3b4b5b6b7b8b9
Le 16/12/2013 02:36\, Christian Millour (via RT) a écrit :
When you open a unix-delimited file (i.e.\, lines end in LF\, not CRLF) on Win32 with my $io = new IO::File($filename\, "\<:encoding(...)") a call to tell $io; seem to corrupt the handle / layers state to the point that the next call to $io->getline does not return the next line as expected.
Note that the problem is not limited to Win32. You get the same misbehavior on unix when using "\<:crlf:encoding(whatever)" on a LF-delimited file.
What is really needed here is a "permissive" :crlf layer\, i.e. one that will allow reading either LF- or CRLF-delimited files. This is more or less already the case in practice\, unless you use an encoding layer with :crlf. In that latter case\, PerlIOEncode_flush() calls PerlIO_unread()\, which resolves to PerlIOCrlf_unread()\, which currently always translate back '\n' as a CR LF pair irrespective of the original content\, potentially thrashing the buffer in the process.
The attached tentative patch implements a form of autodetection of the delimiter actually used in the stream. It involves a new PerlIO flag\, currently named (this is negotiable :)) PERLIO_F_CRLFSAWLF. This flag is set by PerlIOCrlf_get_cnt() on finding a LF (actually a NATIVE_0xd). PerlIOCrlf_unread() then does its specific work only if the flag got set\, and otherwise does a regulard PerlIOBuf_unread(). All bets are off though if the file being read uses both LF and CRLF.
This patch seems to work with blead on linux and Win32\, at least as a proof of concept. Dedicating an PerlIO flag for this might look as a stiff price to pay but it keeps things simple (I thought for a time that playing with (PerlIOSelf(f\, PerlIOCrlf))->nl might be enough but have not been able to build a conviction that it would work in all cases).
The second patch contains a modified version of io_tell_encoding.t to showcase the problem and test solutions on unix as well as Win32.
Opinions / corrections / tests / smokes / alternatives welcome :)
Regards\,
--Christian
On Tue\, Dec 17\, 2013 at 9:32 PM\, Christian Millour \cm\.perl@​abtela\.comwrote:
Note that the problem is not limited to Win32. You get the same misbehavior on unix when using "\<:crlf:encoding(whatever)" on a LF-delimited file.
What is really needed here is a "permissive" :crlf layer\, i.e. one that will allow reading either LF- or CRLF-delimited files. This is more or less already the case in practice\, unless you use an encoding layer with :crlf. In that latter case\, PerlIOEncode_flush() calls PerlIO_unread()\, which resolves to PerlIOCrlf_unread()\, which currently always translate back '\n' as a CR LF pair irrespective of the original content\, potentially thrashing the buffer in the process.
The attached tentative patch implements a form of autodetection of the delimiter actually used in the stream. It involves a new PerlIO flag\, currently named (this is negotiable :)) PERLIO_F_CRLFSAWLF. This flag is set by PerlIOCrlf_get_cnt() on finding a LF (actually a NATIVE_0xd). PerlIOCrlf_unread() then does its specific work only if the flag got set\, and otherwise does a regulard PerlIOBuf_unread(). All bets are off though if the file being read uses both LF and CRLF.
This patch seems to work with blead on linux and Win32\, at least as a proof of concept. Dedicating an PerlIO flag for this might look as a stiff price to pay but it keeps things simple (I thought for a time that playing with (PerlIOSelf(f\, PerlIOCrlf))->nl might be enough but have not been able to build a conviction that it would work in all cases).
The second patch contains a modified version of io_tell_encoding.t to showcase the problem and test solutions on unix as well as Win32.
Opinions / corrections / tests / smokes / alternatives welcome :)
That whole method is an optimization anyway. I'm wondering if getting rid of it wouldn't be a better solution. It makes ungetc less efficient though\, I'm not sure how often it gets used (I thought I had previously applied a patch to make eof not use it by default\, but it appears not). May want to reduce the usage before applying this though.
Leon
The RT System itself - Status changed from 'new' to 'open'
On Thu\, Dec 19\, 2013 at 01:58:05AM +0100\, Leon Timmermans wrote:
On Tue\, Dec 17\, 2013 at 9:32 PM\, Christian Millour \cm\.perl@​abtela\.comwrote:
Given this:
All bets are off though if the file being read uses both LF and CRLF.
then:
That whole method is an optimization anyway. I'm wondering if getting rid
it's not much of an optimsiation if it breaks things.
of it wouldn't be a better solution. It makes ungetc less efficient though\, I'm not sure how often it gets used (I thought I had previously applied a patch to make eof not use it by default\, but it appears not). May want to reduce the usage before applying this though.
Is there any way to gauge how often ungetc() is called?
[snip patch which removes a chunk of code]
I like the direction that your suggested patch is taking the PerlIO codebase.
Nicholas Clark
On Tue\, 17 Dec 2013 20:33:33 GMT\, cm.perl@abtela.com wrote:
Le 16/12/2013 02:36\, Christian Millour (via RT) a écrit :
When you open a unix-delimited file (i.e.\, lines end in LF\, not CRLF) on Win32 with my $io = new IO::File($filename\, "\<:encoding(...)") a call to tell $io; seem to corrupt the handle / layers state to the point that the next call to $io->getline does not return the next line as expected.
Note that the problem is not limited to Win32. You get the same misbehavior on unix when using "\<:crlf:encoding(whatever)" on a LF-delimited file.
What is really needed here is a "permissive" :crlf layer\, i.e. one that will allow reading either LF- or CRLF-delimited files. This is more or less already the case in practice\, unless you use an encoding layer with :crlf. In that latter case\, PerlIOEncode_flush() calls PerlIO_unread()\, which resolves to PerlIOCrlf_unread()\, which currently always translate back '\n' as a CR LF pair irrespective of the original content\, potentially thrashing the buffer in the process.
The attached tentative patch implements a form of autodetection of the delimiter actually used in the stream. It involves a new PerlIO flag\, currently named (this is negotiable :)) PERLIO_F_CRLFSAWLF. This flag is set by PerlIOCrlf_get_cnt() on finding a LF (actually a NATIVE_0xd). PerlIOCrlf_unread() then does its specific work only if the flag got set\, and otherwise does a regulard PerlIOBuf_unread(). All bets are off though if the file being read uses both LF and CRLF.
This patch seems to work with blead on linux and Win32\, at least as a proof of concept. Dedicating an PerlIO flag for this might look as a stiff price to pay but it keeps things simple (I thought for a time that playing with (PerlIOSelf(f\, PerlIOCrlf))->nl might be enough but have not been able to build a conviction that it would work in all cases).
The second patch contains a modified version of io_tell_encoding.t to showcase the problem and test solutions on unix as well as Win32.
Opinions / corrections / tests / smokes / alternatives welcome :)
To make this discussion more visible\, I have created the following smoke branch:
smoke-me/jkeenan/120797-perlio
Regards\,
--Christian
-- James E Keenan (jkeenan@cpan.org)
Mentioned on list recently, e.g., https://www.nntp.perl.org/group/perl.perl5.porters/2020/07/msg257916.html
Has anyone tried to run my patch? That would be helpful
Migrated from rt.perl.org#120797 (status was 'open')
Searchable as RT120797$