Open p5pRT opened 17 years ago
If the the DATA filehandle is read from a UTF16LE encoded source file (on Windows at least)\, it appears the initial offset is out by 1. That is it can be fixed by doing a C\< seek DATA\, 1\, 1 >.
# %\< # Create a UTF16LE encoded test file: my $test_file = "DATA_test.tmp"; #END { unlink $test_file }; open FOUT\, ">:raw:perlio:encoding(utf16le)"\, $test_file or die;
print FOUT \<\<'EOT'; binmode DATA\, ':encoding(utf16le)'; if (0) { # XXX this is needed seek DATA\, 1\, 1; } while(\) { print " # $_"; } __DATA__ 1 2 3 EOT
print "START\n"; system "perl $test_file"; print "\nEND\n"; # >%
The above outputs:
START UTF-16LE:Partial character at DATA_test.tmp line 7. UTF-16LE:Partial character at DATA_test.tmp line 7. Wide character in print at DATA_test.tmp line 8\, \ line 1. UTF-16LE:Partial character at DATA_test.tmp line 8\, \ line 1. # ㄀਀㈀਀㌀਀ END
ie. the multibyte data is being read out of sequence. But if the "if (0)"'d block is turned on\, then the expected output of:
START # 1 # 2 # 3
END
is output.
Does your editor atart the file with a BOM which you later skip with seek()?
-----Ursprüngliche Nachricht----- Von: Davies\, Alex" (via RT) [mailto:perlbug-followup@perl.org] Gesendet: Montag\, 29. Januar 2007 13:07 An: bugs-bitbucket@netlabs.develooper.com Betreff: [perl #41368] DATA filehandle out by 1 on UTF16 source
# New Ticket Created by "Davies\, Alex" # Please include the string: [perl #41368] # in the subject line of all future correspondence about this issue. # \<URL: http://rt.perl.org/rt3/Ticket/Display.html?id=41368 >
This is a bug report for perl from adavies@ADAVIES13D.ptcnet.ptc.com\,
generated with the help of perlbug 1.35 running under perl v5.9.4.
-----------------------------------------------------------------
[Please enter your report here]
If the the DATA filehandle is read from a UTF16LE
encoded source file (on Windows at least)\, it appears
the initial offset is out by 1. That is it can be fixed
by doing a C\< seek DATA\, 1\, 1 >.
# %\<
# Create a UTF16LE encoded test file:
my $test_file = "DATA_test.tmp";
#END { unlink $test_file };
open FOUT\, ">:raw:perlio:encoding(utf16le)"\, $test_file or die;
print FOUT \<\<'EOT';
binmode DATA\, ':encoding(utf16le)';
if (0) { # XXX this is needed
seek DATA\, 1\, 1;
}
while(\) {
print " # $_";
}
__DATA__
1
2
3
EOT
print "START\n";
system "perl $test_file";
print "\nEND\n";
# >%
The above outputs:
START
UTF-16LE:Partial character at DATA_test.tmp line 7.
UTF-16LE:Partial character at DATA_test.tmp line 7.
Wide character in print at DATA_test.tmp line 8\, \ line 1.
UTF-16LE:Partial character at DATA_test.tmp line 8\, \ line 1.
# ㄀਀㈀਀㌀਀
END
ie. the multibyte data is being read out of sequence.
But if the "if (0)"'d block is turned on\, then the expected
output of:
START
# 1
# 2
# 3
END
is output.
[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags:
category=core severity=low
---
This perlbug was built using Perl v5.8.7 - Thu Aug 11 14:02:10 2005
It is being executed now by Perl v5.9.4 - Mon Oct 2 14:06:34 2006.
Site configuration information for perl v5.9.4:
Configured by adavies at Mon Oct 2 14:06:34 2006.
Summary of my perl5 (revision 5 version 9 subversion 4) configuration:
Platform:
osname=MSWin32\, osvers=5\.1\, archname=MSWin32\-x86\-multi\-thread uname='' config\_args='undef' hint=recommended\, useposix=true\, d\_sigaction=undef useithreads=define\, usemultiplicity=define useperlio=define\, d\_sfio=undef\, uselargefiles=define\,
usesocks=undef
use64bitint=undef\, use64bitall=undef\, uselongdouble=undef usemymalloc=n\, bincompat5005=undef
Compiler:
cc='cl'\, ccflags ='\-nologo \-GF \-W3 \-MD \-Zi \-DNDEBUG \-O1
-DWIN32 -D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT
-DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -DPERL_MSVCRT_READFIX'\,optimize='\-MD \-Zi \-DNDEBUG \-O1'\, cppflags='\-DWIN32' ccversion='12\.00\.8804'\, gccversion=''\, gccosandvers='' intsize=4\, longsize=4\, ptrsize=4\, doublesize=8\, byteorder=1234 d\_longlong=undef\, longlongsize=8\, d\_longdbl=define\, longdblsize=10 ivtype='long'\, ivsize=4\, nvtype='double'\, nvsize=8\,
Off_t='__int64'\, lseeksize=8
alignbytes=8\, prototype=define
Linker and Libraries:
ld='link'\, ldflags ='\-nologo \-nodefaultlib \-debug
-opt:ref\,icf -libpath:"c:\perl\lib\CORE" -machine:x86'
libpth=C​:\\PROGRA~1\\MICROS~4\\VC98\\lib libs= oldnames\.lib kernel32\.lib user32\.lib gdi32\.lib
winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib msvcrt.lib
perllibs= oldnames\.lib kernel32\.lib user32\.lib gdi32\.lib
winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib msvcrt.lib
libc=msvcrt\.lib\, so=dll\, useshrplib=yes\, libperl=perl59\.lib gnulibc\_version=''
Dynamic Linking:
dlsrc=dl\_win32\.xs\, dlext=dll\, d\_dlsymun=undef\, ccdlflags=' ' cccdlflags=' '\, lddlflags='\-dll \-nologo \-nodefaultlib
-debug -opt:ref\,icf -libpath:"c:\perl\lib\CORE" -machine:x86'
Locally applied patches:
---
@INC for perl v5.9.4:
D​:/alex/src/perl/perl\-5\.9\.4\.tar/perl\-5\.9\.4/lib \.
---
Environment for perl v5.9.4:
HOME=C​:\\alex LANG \(unset\) LANGUAGE \(unset\) LD\_LIBRARY\_PATH \(unset\) LOGDIR \(unset\) PATH=C​:\\Program
Files\Tcl-8.5.0.0b5\bin;C:\WINNT\system32;C:\WINNT;C:\WINNT\Sy stem32\Wbem;C:\Program Files\QuickTime\QTSystem\;C:\perl3\bin;D:\alex\bin;C:\cygwin\b in;C:\Program Files\Perforce;C:\Program Files\Microsoft Visual Studio\VC98\Bin;C:\Program Files\Microsoft Visual Studio\Common\MSDev98\Bin
PERL\_BADLANG \(unset\) SHELL \(unset\)
The RT System itself - Status changed from 'new' to 'open'
On Mon Jan 29 07:14:34 2007\, dint wrote:
Does your editor atart the file with a BOM which you later skip with seek()?
Yes it does - the standard 2 byte BOM.
I've tried the test without the BOM in the file\, and get the same out by 1 behaviour. I've also tried it using just LF newlines (as opposed to windows' usual CRLF newlines) - again same behaviour.
Does your editor atart the file with a BOM which you later skip with seek()?
Sorry\, I misssed that you create your UTF-16LE source "yourself".
-----Ursprüngliche Nachricht----- Von: Davies\, Alex" (via RT) [mailto:perlbug-followup@perl.org] Gesendet: Montag\, 29. Januar 2007 13:07 An: bugs-bitbucket@netlabs.develooper.com Betreff: [perl #41368] DATA filehandle out by 1 on UTF16 source
# New Ticket Created by "Davies\, Alex" # Please include the string: [perl #41368] # in the subject line of all future correspondence about this issue. # \<URL: http://rt.perl.org/rt3/Ticket/Display.html?id=41368 >
This is a bug report for perl from adavies@ADAVIES13D.ptcnet.ptc.com\,
generated with the help of perlbug 1.35 running under perl v5.9.4.
-----------------------------------------------------------------
[Please enter your report here]
If the the DATA filehandle is read from a UTF16LE
encoded source file (on Windows at least)\, it appears
the initial offset is out by 1. That is it can be fixed
by doing a C\< seek DATA\, 1\, 1 >.
# %\<
# Create a UTF16LE encoded test file:
my $test_file = "DATA_test.tmp";
#END { unlink $test_file };
open FOUT\, ">:raw:perlio:encoding(utf16le)"\, $test_file or die;
print FOUT \<\<'EOT';
binmode DATA\, ':encoding(utf16le)';
if (0) { # XXX this is needed
seek DATA\, 1\, 1;
}
while(\) {
print " # $_";
}
__DATA__
1
2
3
EOT
print "START\n";
system "perl $test_file";
print "\nEND\n";
# >%
The above outputs:
START
UTF-16LE:Partial character at DATA_test.tmp line 7.
UTF-16LE:Partial character at DATA_test.tmp line 7.
Wide character in print at DATA_test.tmp line 8\, \ line 1.
UTF-16LE:Partial character at DATA_test.tmp line 8\, \ line 1.
# ㄀਀㈀਀㌀਀
END
ie. the multibyte data is being read out of sequence.
But if the "if (0)"'d block is turned on\, then the expected
output of:
START
# 1
# 2
# 3
END
is output.
[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags:
category=core severity=low
---
This perlbug was built using Perl v5.8.7 - Thu Aug 11 14:02:10 2005
It is being executed now by Perl v5.9.4 - Mon Oct 2 14:06:34 2006.
Site configuration information for perl v5.9.4:
Configured by adavies at Mon Oct 2 14:06:34 2006.
Summary of my perl5 (revision 5 version 9 subversion 4) configuration:
Platform:
osname=MSWin32\, osvers=5\.1\, archname=MSWin32\-x86\-multi\-thread uname='' config\_args='undef' hint=recommended\, useposix=true\, d\_sigaction=undef useithreads=define\, usemultiplicity=define useperlio=define\, d\_sfio=undef\, uselargefiles=define\,
usesocks=undef
use64bitint=undef\, use64bitall=undef\, uselongdouble=undef usemymalloc=n\, bincompat5005=undef
Compiler:
cc='cl'\, ccflags ='\-nologo \-GF \-W3 \-MD \-Zi \-DNDEBUG \-O1
-DWIN32 -D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT
-DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -DPERL_MSVCRT_READFIX'\,optimize='\-MD \-Zi \-DNDEBUG \-O1'\, cppflags='\-DWIN32' ccversion='12\.00\.8804'\, gccversion=''\, gccosandvers='' intsize=4\, longsize=4\, ptrsize=4\, doublesize=8\, byteorder=1234 d\_longlong=undef\, longlongsize=8\, d\_longdbl=define\,
longdblsize=10
ivtype='long'\, ivsize=4\, nvtype='double'\, nvsize=8\,
Off_t='__int64'\, lseeksize=8
alignbytes=8\, prototype=define
Linker and Libraries:
ld='link'\, ldflags ='\-nologo \-nodefaultlib \-debug
-opt:ref\,icf -libpath:"c:\perl\lib\CORE" -machine:x86'
libpth=C​:\\PROGRA~1\\MICROS~4\\VC98\\lib libs= oldnames\.lib kernel32\.lib user32\.lib gdi32\.lib
winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib msvcrt.lib
perllibs= oldnames\.lib kernel32\.lib user32\.lib gdi32\.lib
winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib msvcrt.lib
libc=msvcrt\.lib\, so=dll\, useshrplib=yes\, libperl=perl59\.lib gnulibc\_version=''
Dynamic Linking:
dlsrc=dl\_win32\.xs\, dlext=dll\, d\_dlsymun=undef\, ccdlflags=' ' cccdlflags=' '\, lddlflags='\-dll \-nologo \-nodefaultlib
-debug -opt:ref\,icf -libpath:"c:\perl\lib\CORE" -machine:x86'
Locally applied patches:
---
@INC for perl v5.9.4:
D​:/alex/src/perl/perl\-5\.9\.4\.tar/perl\-5\.9\.4/lib \.
---
Environment for perl v5.9.4:
HOME=C​:\\alex LANG \(unset\) LANGUAGE \(unset\) LD\_LIBRARY\_PATH \(unset\) LOGDIR \(unset\) PATH=C​:\\Program
Files\Tcl-8.5.0.0b5\bin;C:\WINNT\system32;C:\WINNT;C:\WINNT\Sy stem32\Wbem;C:\Program Files\QuickTime\QTSystem\;C:\perl3\bin;D:\alex\bin;C:\cygwin\b in;C:\Program Files\Perforce;C:\Program Files\Microsoft Visual Studio\VC98\Bin;C:\Program Files\Microsoft Visual Studio\Common\MSDev98\Bin
PERL\_BADLANG \(unset\) SHELL \(unset\)
On Mon Jan 29 04:07:18 2007\, adavies@ptc.com wrote:
This is a bug report for perl from adavies@ADAVIES13D.ptcnet.ptc.com\, generated with the help of perlbug 1.35 running under perl v5.9.4. ----------------------------------------------------------------- [Please enter your report here]
If the the DATA filehandle is read from a UTF16LE encoded source file (on Windows at least)\, it appears the initial offset is out by 1. That is it can be fixed by doing a C\< seek DATA\, 1\, 1 >.
Running (my line numbers are different than the quote below) ______________________________________________________________
# Create a UTF16LE encoded test file: my $test_file = "DATA_test.tmp"; #END { unlink $test_file }; open FOUT\, ">:raw:perlio:encoding(utf16le)"\, $test_file or die;
print FOUT \<\<'EOT'; binmode DATA\, ':encoding(utf16le)'; if (0) { # XXX this is needed seek DATA\, 1\, 1; } while(\) { print " # $_"; } __DATA__ 1 2 3 EOT
print "START\n"; system "perl $test_file"; print "\nEND\n"; __________________________________________________________________
on win32 Perl 5.10
__________________________________________________________________ C:\Documents and Settings\Owner\Desktop>perl 41368.pl START UTF-16LE:Partial character at DATA_test.tmp line 5. UTF-16LE:Partial character at DATA_test.tmp line 5. Wide character in print at DATA_test.tmp line 6\, \ line 1. UTF-16LE:Partial character at DATA_test.tmp line 6\, \ line 1. # ㄀਀㈀਀㌀਀ END
C:\Documents and Settings\Owner\Desktop> __________________________________________________________________
on win32 Perl 5.12 __________________________________________________________________ C:\Documents and Settings\Owner\Desktop>perl 41368.pl START
END
C:\Documents and Settings\Owner\Desktop> ___________________________________________________________________ on win32 Perl 5.17.6 ___________________________________________________________________ C:\p517\perl\win32>perl "C:\Documents and Settings\Owner\Desktop\41368.pl" START
END
C:\p517\perl\win32> ___________________________________________________________________
I suggest for someone who knows more about perlio/layers/encoding/unicode to comment in this ticket on whether there was a bug in the past in this ticket\, and is there still a bug in blead in the present or not.
-- bulk88 ~ bulk88 at hotmail.com
On Wed\, Dec 26\, 2012 at 4:48 PM\, bulk88 via RT \perlbug\-followup@​perl\.orgwrote:
I suggest for someone who knows more about perlio/layers/encoding/unicode to comment in this ticket on whether there was a bug in the past in this ticket\, and is there still a bug in blead in the present or not.
If there was\, there still is. Except now it's not off by one\, it's off by a few hundreds.
-----BEGIN UPDATED CODE----- # Create a UTF16LE encoded test file: my $test_file = "DATA_test.tmp"; #END { unlink $test_file }; open FOUT\, ">:raw:perlio:encoding(utf16le)"\, $test_file or die;
print FOUT \<\<'EOT'; binmode DATA\, ':encoding(utf16le)'; while(\) { print " # $_"; } __DATA__ EOT
print FOUT "$_\n" for 1..50;
print "START\n"; system "perl $test_file"; print "\nEND\n"; -----END UPDATED CODE-----
-----BEGIN OUTPUT----- START # 27 # 28 # 29 # 30 # 31 # 32 # 33 # 34 # 35 # 36 # 37 # 38 # 39 # 40 # 41 # 42 # 43 # 44 # 45 # 46 # 47 # 48 # 49 # 50
END -----END OUTPUT-----
When Perl detects that the file is in UTF-16, it adds a source filter inside toke.c and delivers the file as UTF-8 to the rest of the parser. I suspect this is a units issue, that the offset that is passed to PerlIO to indicate the beginning of where to read is in terms of two-byte units, and PerlIO thinks it is in terms of single bytes, so multiplies by 2, creating an offset further in in which to read. @leont does this idea lead you to where to check if it's true?
Migrated from rt.perl.org#41368 (status was 'open')
Searchable as RT41368$