Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.91k stars 542 forks source link

DATA filehandle off on UTF16 source #8754

Open p5pRT opened 17 years ago

p5pRT commented 17 years ago

Migrated from rt.perl.org#41368 (status was 'open')

Searchable as RT41368$

p5pRT commented 17 years ago

From adavies@ptc.com

Created by adavies@ADAVIES13D.ptcnet.ptc.com

If the the DATA filehandle is read from a UTF16LE encoded source file (on Windows at least)\, it appears the initial offset is out by 1. That is it can be fixed by doing a C\< seek DATA\, 1\, 1 >.

# %\< # Create a UTF16LE encoded test file​: my $test_file = "DATA_test.tmp"; #END { unlink $test_file }; open FOUT\, ">​:raw​:perlio​:encoding(utf16le)"\, $test_file or die;

print FOUT \<\<'EOT'; binmode DATA\, '​:encoding(utf16le)'; if (0) { # XXX this is needed   seek DATA\, 1\, 1; } while(\) {   print " # $_"; } __DATA__ 1 2 3 EOT

print "START\n"; system "perl $test_file"; print "\nEND\n"; # >%

The above outputs​:

START UTF-16LE​:Partial character at DATA_test.tmp line 7. UTF-16LE​:Partial character at DATA_test.tmp line 7. Wide character in print at DATA_test.tmp line 8\, \ line 1. UTF-16LE​:Partial character at DATA_test.tmp line 8\, \ line 1. # ㄀਀㈀਀㌀਀ END

ie. the multibyte data is being read out of sequence. But if the "if (0)"'d block is turned on\, then the expected output of​:

START # 1 # 2 # 3

END

is output.

Perl Info ``` --- Flags: category=core severity=low --- This perlbug was built using Perl v5.8.7 - Thu Aug 11 14:02:10 2005 It is being executed now by Perl v5.9.4 - Mon Oct 2 14:06:34 2006. Site configuration information for perl v5.9.4: Configured by adavies at Mon Oct 2 14:06:34 2006. Summary of my perl5 (revision 5 version 9 subversion 4) configuration: Platform: osname=MSWin32, osvers=5.1, archname=MSWin32-x86-multi-thread uname='' config_args='undef' hint=recommended, useposix=true, d_sigaction=undef useithreads=define, usemultiplicity=define useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=undef, use64bitall=undef, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cl', ccflags ='-nologo -GF -W3 -MD -Zi -DNDEBUG -O1 -DWIN32 -D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -DPERL_MSVCRT_READFIX', optimize='-MD -Zi -DNDEBUG -O1', cppflags='-DWIN32' ccversion='12.00.8804', gccversion='', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=undef, longlongsize=8, d_longdbl=define, longdblsize=10 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='__int64', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='link', ldflags ='-nologo -nodefaultlib -debug -opt:ref,icf -libpath:"c:\perl\lib\CORE" -machine:x86' libpth=C:\PROGRA~1\MICROS~4\VC98\lib libs= oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib msvcrt.lib perllibs= oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib msvcrt.lib libc=msvcrt.lib, so=dll, useshrplib=yes, libperl=perl59.lib gnulibc_version='' Dynamic Linking: dlsrc=dl_win32.xs, dlext=dll, d_dlsymun=undef, ccdlflags=' ' cccdlflags=' ', lddlflags='-dll -nologo -nodefaultlib -debug -opt:ref,icf -libpath:"c:\perl\lib\CORE" -machine:x86' Locally applied patches: --- @INC for perl v5.9.4: D:/alex/src/perl/perl-5.9.4.tar/perl-5.9.4/lib . --- Environment for perl v5.9.4: HOME=C:\alex LANG (unset) LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=C:\Program Files\Tcl-8.5.0.0b5\bin;C:\WINNT\system32;C:\WINNT;C:\WINNT\System32\Wbem;C:\Program Files\QuickTime\QTSystem\;C:\perl3\bin;D:\alex\bin;C:\cygwin\bin;C:\Program Files\Perforce;C:\Program Files\Microsoft Visual Studio\VC98\Bin;C:\Program Files\Microsoft Visual Studio\Common\MSDev98\Bin PERL_BADLANG (unset) SHELL (unset) ```
p5pRT commented 17 years ago

From Peter.Dintelmann@dresdner-bank.com

Does your editor atart the file with a BOM which you later skip with seek()?

-----Ursprüngliche Nachricht----- Von​: Davies\, Alex" (via RT) [mailto​:perlbug-followup@​perl.org] Gesendet​: Montag\, 29. Januar 2007 13​:07 An​: bugs-bitbucket@​netlabs.develooper.com Betreff​: [perl #41368] DATA filehandle out by 1 on UTF16 source

# New Ticket Created by "Davies\, Alex" # Please include the string​: [perl #41368] # in the subject line of all future correspondence about this issue. # \<URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=41368 >

This is a bug report for perl from adavies@​ADAVIES13D.ptcnet.ptc.com\,

generated with the help of perlbug 1.35 running under perl v5.9.4.

-----------------------------------------------------------------

[Please enter your report here]

If the the DATA filehandle is read from a UTF16LE

encoded source file (on Windows at least)\, it appears

the initial offset is out by 1. That is it can be fixed

by doing a C\< seek DATA\, 1\, 1 >.

# %\<

# Create a UTF16LE encoded test file​:

my $test_file = "DATA_test.tmp";

#END { unlink $test_file };

open FOUT\, ">​:raw​:perlio​:encoding(utf16le)"\, $test_file or die;

print FOUT \<\<'EOT';

binmode DATA\, '​:encoding(utf16le)';

if (0) { # XXX this is needed

seek DATA\, 1\, 1;

}

while(\) {

print " # $_";

}

__DATA__

1

2

3

EOT

print "START\n";

system "perl $test_file";

print "\nEND\n";

# >%

The above outputs​:

START

UTF-16LE​:Partial character at DATA_test.tmp line 7.

UTF-16LE​:Partial character at DATA_test.tmp line 7.

Wide character in print at DATA_test.tmp line 8\, \ line 1.

UTF-16LE​:Partial character at DATA_test.tmp line 8\, \ line 1.

# ㄀਀㈀਀㌀਀

END

ie. the multibyte data is being read out of sequence.

But if the "if (0)"'d block is turned on\, then the expected

output of​:

START

# 1

# 2

# 3

END

is output.

[Please do not change anything below this line]

-----------------------------------------------------------------

---

Flags​:

category=core

severity=low

---

This perlbug was built using Perl v5.8.7 - Thu Aug 11 14​:02​:10 2005

It is being executed now by Perl v5.9.4 - Mon Oct 2 14​:06​:34 2006.

Site configuration information for perl v5.9.4​:

Configured by adavies at Mon Oct 2 14​:06​:34 2006.

Summary of my perl5 (revision 5 version 9 subversion 4) configuration​:

Platform​:

osname=MSWin32\, osvers=5\.1\, archname=MSWin32\-x86\-multi\-thread

uname=''

config\_args='undef'

hint=recommended\, useposix=true\, d\_sigaction=undef

useithreads=define\, usemultiplicity=define

useperlio=define\, d\_sfio=undef\, uselargefiles=define\, 

usesocks=undef

use64bitint=undef\, use64bitall=undef\, uselongdouble=undef

usemymalloc=n\, bincompat5005=undef

Compiler​:

cc='cl'\, ccflags ='\-nologo \-GF \-W3 \-MD \-Zi \-DNDEBUG \-O1 

-DWIN32 -D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT
-DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -DPERL_MSVCRT_READFIX'\,

optimize='\-MD \-Zi \-DNDEBUG \-O1'\,

cppflags='\-DWIN32'

ccversion='12\.00\.8804'\, gccversion=''\, gccosandvers=''

intsize=4\, longsize=4\, ptrsize=4\, doublesize=8\, byteorder=1234

d\_longlong=undef\, longlongsize=8\, d\_longdbl=define\, longdblsize=10

ivtype='long'\, ivsize=4\, nvtype='double'\, nvsize=8\, 

Off_t='__int64'\, lseeksize=8

alignbytes=8\, prototype=define

Linker and Libraries​:

ld='link'\, ldflags ='\-nologo \-nodefaultlib \-debug 

-opt​:ref\,icf -libpath​:"c​:\perl\lib\CORE" -machine​:x86'

libpth=C&#8203;:\\PROGRA~1\\MICROS~4\\VC98\\lib

libs=  oldnames\.lib kernel32\.lib user32\.lib gdi32\.lib 

winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib msvcrt.lib

perllibs=  oldnames\.lib kernel32\.lib user32\.lib gdi32\.lib 

winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib msvcrt.lib

libc=msvcrt\.lib\, so=dll\, useshrplib=yes\, libperl=perl59\.lib

gnulibc\_version=''

Dynamic Linking​:

dlsrc=dl\_win32\.xs\, dlext=dll\, d\_dlsymun=undef\, ccdlflags=' '

cccdlflags=' '\, lddlflags='\-dll \-nologo \-nodefaultlib 

-debug -opt​:ref\,icf -libpath​:"c​:\perl\lib\CORE" -machine​:x86'

Locally applied patches​:

---

@​INC for perl v5.9.4​:

D&#8203;:/alex/src/perl/perl\-5\.9\.4\.tar/perl\-5\.9\.4/lib

\.

---

Environment for perl v5.9.4​:

HOME=C&#8203;:\\alex

LANG \(unset\)

LANGUAGE \(unset\)

LD\_LIBRARY\_PATH \(unset\)

LOGDIR \(unset\)

PATH=C&#8203;:\\Program 

Files\Tcl-8.5.0.0b5\bin;C​:\WINNT\system32;C​:\WINNT;C​:\WINNT\Sy stem32\Wbem;C​:\Program Files\QuickTime\QTSystem\;C​:\perl3\bin;D​:\alex\bin;C​:\cygwin\b in;C​:\Program Files\Perforce;C​:\Program Files\Microsoft Visual Studio\VC98\Bin;C​:\Program Files\Microsoft Visual Studio\Common\MSDev98\Bin

PERL\_BADLANG \(unset\)

SHELL \(unset\)
p5pRT commented 17 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 17 years ago

From guest@guest.guest.xxxxxxxx

On Mon Jan 29 07​:14​:34 2007\, dint wrote​:

Does your editor atart the file with a BOM which you later skip with seek()?

Yes it does - the standard 2 byte BOM.

I've tried the test without the BOM in the file\, and get the same out by 1 behaviour. I've also tried it using just LF newlines (as opposed to windows' usual CRLF newlines) - again same behaviour.

p5pRT commented 17 years ago

From Peter.Dintelmann@dresdner-bank.com

Does your editor atart the file with a BOM which you later skip with seek()?

  Sorry\, I misssed that you create your UTF-16LE   source "yourself".

-----Ursprüngliche Nachricht----- Von​: Davies\, Alex" (via RT) [mailto​:perlbug-followup@​perl.org] Gesendet​: Montag\, 29. Januar 2007 13​:07 An​: bugs-bitbucket@​netlabs.develooper.com Betreff​: [perl #41368] DATA filehandle out by 1 on UTF16 source

# New Ticket Created by "Davies\, Alex" # Please include the string​: [perl #41368] # in the subject line of all future correspondence about this issue. # \<URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=41368 >

This is a bug report for perl from adavies@​ADAVIES13D.ptcnet.ptc.com\,

generated with the help of perlbug 1.35 running under perl v5.9.4.

-----------------------------------------------------------------

[Please enter your report here]

If the the DATA filehandle is read from a UTF16LE

encoded source file (on Windows at least)\, it appears

the initial offset is out by 1. That is it can be fixed

by doing a C\< seek DATA\, 1\, 1 >.

# %\<

# Create a UTF16LE encoded test file​:

my $test_file = "DATA_test.tmp";

#END { unlink $test_file };

open FOUT\, ">​:raw​:perlio​:encoding(utf16le)"\, $test_file or die;

print FOUT \<\<'EOT';

binmode DATA\, '​:encoding(utf16le)';

if (0) { # XXX this is needed

seek DATA\, 1\, 1;

}

while(\) {

print " # $_";

}

__DATA__

1

2

3

EOT

print "START\n";

system "perl $test_file";

print "\nEND\n";

# >%

The above outputs​:

START

UTF-16LE​:Partial character at DATA_test.tmp line 7.

UTF-16LE​:Partial character at DATA_test.tmp line 7.

Wide character in print at DATA_test.tmp line 8\, \ line 1.

UTF-16LE​:Partial character at DATA_test.tmp line 8\, \ line 1.

# ㄀਀㈀਀㌀਀

END

ie. the multibyte data is being read out of sequence.

But if the "if (0)"'d block is turned on\, then the expected

output of​:

START

# 1

# 2

# 3

END

is output.

[Please do not change anything below this line]

-----------------------------------------------------------------

---

Flags​:

category=core

severity=low

---

This perlbug was built using Perl v5.8.7 - Thu Aug 11 14​:02​:10 2005

It is being executed now by Perl v5.9.4 - Mon Oct 2 14​:06​:34 2006.

Site configuration information for perl v5.9.4​:

Configured by adavies at Mon Oct 2 14​:06​:34 2006.

Summary of my perl5 (revision 5 version 9 subversion 4) configuration​:

Platform​:

osname=MSWin32\, osvers=5\.1\, archname=MSWin32\-x86\-multi\-thread

uname=''

config\_args='undef'

hint=recommended\, useposix=true\, d\_sigaction=undef

useithreads=define\, usemultiplicity=define

useperlio=define\, d\_sfio=undef\, uselargefiles=define\, 

usesocks=undef

use64bitint=undef\, use64bitall=undef\, uselongdouble=undef

usemymalloc=n\, bincompat5005=undef

Compiler​:

cc='cl'\, ccflags ='\-nologo \-GF \-W3 \-MD \-Zi \-DNDEBUG \-O1 

-DWIN32 -D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT
-DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -DPERL_MSVCRT_READFIX'\,

optimize='\-MD \-Zi \-DNDEBUG \-O1'\,

cppflags='\-DWIN32'

ccversion='12\.00\.8804'\, gccversion=''\, gccosandvers=''

intsize=4\, longsize=4\, ptrsize=4\, doublesize=8\, byteorder=1234

d\_longlong=undef\, longlongsize=8\, d\_longdbl=define\, 

longdblsize=10

ivtype='long'\, ivsize=4\, nvtype='double'\, nvsize=8\, 

Off_t='__int64'\, lseeksize=8

alignbytes=8\, prototype=define

Linker and Libraries​:

ld='link'\, ldflags ='\-nologo \-nodefaultlib \-debug 

-opt​:ref\,icf -libpath​:"c​:\perl\lib\CORE" -machine​:x86'

libpth=C&#8203;:\\PROGRA~1\\MICROS~4\\VC98\\lib

libs=  oldnames\.lib kernel32\.lib user32\.lib gdi32\.lib 

winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib msvcrt.lib

perllibs=  oldnames\.lib kernel32\.lib user32\.lib gdi32\.lib 

winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib msvcrt.lib

libc=msvcrt\.lib\, so=dll\, useshrplib=yes\, libperl=perl59\.lib

gnulibc\_version=''

Dynamic Linking​:

dlsrc=dl\_win32\.xs\, dlext=dll\, d\_dlsymun=undef\, ccdlflags=' '

cccdlflags=' '\, lddlflags='\-dll \-nologo \-nodefaultlib 

-debug -opt​:ref\,icf -libpath​:"c​:\perl\lib\CORE" -machine​:x86'

Locally applied patches​:

---

@​INC for perl v5.9.4​:

D&#8203;:/alex/src/perl/perl\-5\.9\.4\.tar/perl\-5\.9\.4/lib

\.

---

Environment for perl v5.9.4​:

HOME=C&#8203;:\\alex

LANG \(unset\)

LANGUAGE \(unset\)

LD\_LIBRARY\_PATH \(unset\)

LOGDIR \(unset\)

PATH=C&#8203;:\\Program 

Files\Tcl-8.5.0.0b5\bin;C​:\WINNT\system32;C​:\WINNT;C​:\WINNT\Sy stem32\Wbem;C​:\Program Files\QuickTime\QTSystem\;C​:\perl3\bin;D​:\alex\bin;C​:\cygwin\b in;C​:\Program Files\Perforce;C​:\Program Files\Microsoft Visual Studio\VC98\Bin;C​:\Program Files\Microsoft Visual Studio\Common\MSDev98\Bin

PERL\_BADLANG \(unset\)

SHELL \(unset\)
p5pRT commented 11 years ago

From @bulk88

On Mon Jan 29 04​:07​:18 2007\, adavies@​ptc.com wrote​:

This is a bug report for perl from adavies@​ADAVIES13D.ptcnet.ptc.com\, generated with the help of perlbug 1.35 running under perl v5.9.4. ----------------------------------------------------------------- [Please enter your report here]

If the the DATA filehandle is read from a UTF16LE encoded source file (on Windows at least)\, it appears the initial offset is out by 1. That is it can be fixed by doing a C\< seek DATA\, 1\, 1 >.

Running (my line numbers are different than the quote below) ______________________________________________________________

# Create a UTF16LE encoded test file​: my $test_file = "DATA_test.tmp"; #END { unlink $test_file }; open FOUT\, ">​:raw​:perlio​:encoding(utf16le)"\, $test_file or die;

print FOUT \<\<'EOT'; binmode DATA\, '​:encoding(utf16le)'; if (0) { # XXX this is needed seek DATA\, 1\, 1; } while(\) { print " # $_"; } __DATA__ 1 2 3 EOT

print "START\n"; system "perl $test_file"; print "\nEND\n"; __________________________________________________________________

on win32 Perl 5.10

__________________________________________________________________ C​:\Documents and Settings\Owner\Desktop>perl 41368.pl START UTF-16LE​:Partial character at DATA_test.tmp line 5. UTF-16LE​:Partial character at DATA_test.tmp line 5. Wide character in print at DATA_test.tmp line 6\, \ line 1. UTF-16LE​:Partial character at DATA_test.tmp line 6\, \ line 1. # ㄀਀㈀਀㌀਀ END

C​:\Documents and Settings\Owner\Desktop> __________________________________________________________________

on win32 Perl 5.12 __________________________________________________________________ C​:\Documents and Settings\Owner\Desktop>perl 41368.pl START

END

C​:\Documents and Settings\Owner\Desktop> ___________________________________________________________________ on win32 Perl 5.17.6 ___________________________________________________________________ C​:\p517\perl\win32>perl "C​:\Documents and Settings\Owner\Desktop\41368.pl" START

END

C​:\p517\perl\win32> ___________________________________________________________________

I suggest for someone who knows more about perlio/layers/encoding/unicode to comment in this ticket on whether there was a bug in the past in this ticket\, and is there still a bug in blead in the present or not.

-- bulk88 ~ bulk88 at hotmail.com

p5pRT commented 11 years ago

From @ikegami

On Wed\, Dec 26\, 2012 at 4​:48 PM\, bulk88 via RT \perlbug\-followup@&#8203;perl\.orgwrote​:

I suggest for someone who knows more about perlio/layers/encoding/unicode to comment in this ticket on whether there was a bug in the past in this ticket\, and is there still a bug in blead in the present or not.

If there was\, there still is. Except now it's not off by one\, it's off by a few hundreds.

-----BEGIN UPDATED CODE----- # Create a UTF16LE encoded test file​: my $test_file = "DATA_test.tmp"; #END { unlink $test_file }; open FOUT\, ">​:raw​:perlio​:encoding(utf16le)"\, $test_file or die;

print FOUT \<\<'EOT'; binmode DATA\, '​:encoding(utf16le)'; while(\) { print " # $_"; } __DATA__ EOT

print FOUT "$_\n" for 1..50;

print "START\n"; system "perl $test_file"; print "\nEND\n"; -----END UPDATED CODE-----

-----BEGIN OUTPUT----- START # 27 # 28 # 29 # 30 # 31 # 32 # 33 # 34 # 35 # 36 # 37 # 38 # 39 # 40 # 41 # 42 # 43 # 44 # 45 # 46 # 47 # 48 # 49 # 50

END -----END OUTPUT-----

khwilliamson commented 2 years ago

When Perl detects that the file is in UTF-16, it adds a source filter inside toke.c and delivers the file as UTF-8 to the rest of the parser. I suspect this is a units issue, that the offset that is passed to PerlIO to indicate the beginning of where to read is in terms of two-byte units, and PerlIO thinks it is in terms of single bytes, so multiplies by 2, creating an offset further in in which to read. @leont does this idea lead you to where to check if it's true?