Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.88k stars 530 forks source link

New unicode support / bug or feature request #1985

Closed p5pRT closed 20 years ago

p5pRT commented 24 years ago

Migrated from rt.perl.org#3258 (status was 'resolved')

Searchable as RT3258$

p5pRT commented 24 years ago

From tim.burlowski@veritas.com

Wishing to try understand the new Unicode features in Perl 5.6\, I started playing.

Lacking a decent unicode editor on my Solaris system I foolishly fired up Notepad in Windows 2000.

I saved my happy perl script as a utf8 file.

I got the following error.

Unrecognized character \xEF at D​:\perlscripts\unicode.pl line 1.

What's that? I don't remember typing in \xEF ...

So I fired up od to look at the file. Say there are a few suspicious bytes at the beginning of the file.

Turns out\, after some annoying reasearch these characters are "Byte Order Marks".

OK\, I didn't ask for them\, but at least now I understand.

So should Perl ignore the BOM at the beginning of a Perl script? If so I found a bug\, if not I would like to make a feature request.

The Unicode BOM is never a valid character so it would always be safe to ignore as far as I am concerned.

The Perl script is attached. You may wish to use 'od' to examine the contents\, especially if you are using Notepad from Microsoft.

Let me know if you have further questions. I would like to know what the outcome is.

thanks\,

tim burlowski 5/18/19100

__DATA__

Output from 'perlbug -d' if that's useful.


Flags​:   category=   severity=


Site configuration information for perl v5.6.0​:

Configured by txb at Thu Apr 27 13​:33​:42 CDT 2000.

Summary of my perl5 (revision 5.0 version 6 subversion 0) configuration​:   Platform​:   osname=solaris\, osvers=2.6\, archname=sun4-solaris   uname='sunos mum 5.6 generic_105181-05 sun4u sparc sunw\,ultra-5_10 '   config_args=''   hint=recommended\, useposix=true\, d_sigaction=define   usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef   useperlio=undef d_sfio=undef uselargefiles=define   use64bitint=undef use64bitall=undef uselongdouble=undef usesocks=undef   Compiler​:   cc='cc'\, optimize='-O'\, gccversion=   cppflags=''   ccflags =' -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'   stdchar='unsigned char'\, d_stdstdio=define\, usevfork=false   intsize=4\, longsize=4\, ptrsize=4\, doublesize=8   d_longlong=define\, longlongsize=8\, d_longdbl=define\, longdblsize=16   ivtype='long'\, ivsize=4\, nvtype='double'\, nvsize=8\, Off_t='off_t'\, lseeksize=8   alignbytes=8\, usemymalloc=y\, prototype=define   Linker and Libraries​:   ld='cc'\, ldflags ='-L/opt/SUNWspro/SC4.2/lib '   libpth=/usr/local/lib /opt/SUNWspro/SC4.2/lib /lib /usr/lib /usr/ccs/lib   libs=-lsocket -lnsl -ldl -lm -lc -lcrypt -lsec   libc=\, so=so\, useshrplib=false\, libperl=libperl.a   Dynamic Linking​:   dlsrc=dl_dlopen.xs\, dlext=so\, d_dlsymun=undef\, ccdlflags=' '   cccdlflags='-KPIC'\, lddlflags='-G -L/opt/SUNWspro/SC4.2/lib'

Locally applied patches​:


@​INC for perl v5.6.0​:   /opt/perl56/lib/5.6.0/sun4-solaris   /opt/perl56/lib/5.6.0   /opt/perl56/lib/site_perl/5.6.0/sun4-solaris   /opt/perl56/lib/site_perl/5.6.0   /opt/perl56/lib/site_perl   .


Environment for perl v5.6.0​:   HOME=[snip]   LANG=C   LANGUAGE (unset)   LD_LIBRARY_PATH= [snip]   LOGDIR (unset)   PATH= [snip]   PERL_BADLANG (unset)   SHELL=/bin/csh

p5pRT commented 24 years ago

From tim.burlowski@veritas.com

unicode.pl

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

Wishing to try understand the new Unicode features in Perl 5.6\, I started playing.

Lacking a decent unicode editor on my Solaris system I foolishly fired up Notepad in Windows 2000.

I saved my happy perl script as a utf8 file.

I got the following error.

Unrecognized character \xEF at D​:\perlscripts\unicode.pl line 1.

What's that? I don't remember typing in \xEF ...

So I fired up od to look at the file. Say there are a few suspicious bytes at the beginning of the file.

Turns out\, after some annoying reasearch these characters are "Byte Order Marks".

OK\, I didn't ask for them\, but at least now I understand.

So should Perl ignore the BOM at the beginning of a Perl script? If so I found a bug\, if not I would like to make a feature request.

The Unicode BOM is never a valid character so it would always be safe to ignore as far as I am concerned.

The Perl script is attached. You may wish to use 'od' to examine the contents\, especially if you are using Notepad from Microsoft.

Let me know if you have further questions. I would like to know what the outcome is.

thanks\,

tim burlowski 5/18/19100

__DATA__

Output from 'perlbug -d' if that's useful. --- Flags​: category= severity= --- Site configuration information for perl v5.6.0​:

Configured by txb at Thu Apr 27 13​:33​:42 CDT 2000.

Summary of my perl5 (revision 5.0 version 6 subversion 0) configuration​: Platform​: osname=solaris\, osvers=2.6\, archname=sun4-solaris uname='sunos mum 5.6 generic_105181-05 sun4u sparc sunw\,ultra-5_10 ' config_args='' hint=recommended\, useposix=true\, d_sigaction=define usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef useperlio=undef d_sfio=undef uselargefiles=define use64bitint=undef use64bitall=undef uselongdouble=undef usesocks=undef Compiler​: cc='cc'\, optimize='-O'\, gccversion= cppflags='' ccflags =' -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64' stdchar='unsigned char'\, d_stdstdio=define\, usevfork=false intsize=4\, longsize=4\, ptrsize=4\, doublesize=8 d_longlong=define\, longlongsize=8\, d_longdbl=define\, longdblsize=16 ivtype='long'\, ivsize=4\, nvtype='double'\, nvsize=8\, Off_t='off_t'\, lseeksize=8 alignbytes=8\, usemymalloc=y\, prototype=define Linker and Libraries​: ld='cc'\, ldflags ='-L/opt/SUNWspro/SC4.2/lib ' libpth=/usr/local/lib /opt/SUNWspro/SC4.2/lib /lib /usr/lib /usr/ccs/lib libs=-lsocket -lnsl -ldl -lm -lc -lcrypt -lsec libc=\, so=so\, useshrplib=false\, libperl=libperl.a Dynamic Linking​: dlsrc=dl_dlopen.xs\, dlext=so\, d_dlsymun=undef\, ccdlflags=' ' cccdlflags='-KPIC'\, lddlflags='-G -L/opt/SUNWspro/SC4.2/lib'

Locally applied patches​:

--- @​INC for perl v5.6.0​: /opt/perl56/lib/5.6.0/sun4-solaris /opt/perl56/lib/5.6.0 /opt/perl56/lib/site_perl/5.6.0/sun4-solaris /opt/perl56/lib/site_perl/5.6.0 /opt/perl56/lib/site_perl .

--- Environment for perl v5.6.0​: HOME=[snip] LANG=C LANGUAGE (unset) LD_LIBRARY_PATH= [snip] LOGDIR (unset) PATH= [snip] PERL_BADLANG (unset) SHELL=/bin/csh

------=_NextPart_000_0000_01BFC0EA.8A9DC960 Content-Type​: application/octet-stream; name="unicode.pl" Content-Transfer-Encoding​: quoted-printable Content-Disposition​: attachment; filename="unicode.pl"

=EF=BB=BF#!/usr/local/perl/bin

# Test of new unicode pragma in Perl 5.6

use utf8; use strict; use warnings;

while (\) { my $len =3D length; print "line is $len characters long\n"; }

__DATA__ TEST =EF=BC=B4=EF=BC=A5=EF=BC=B3=EF=BC=B4 ------=_NextPart_000_0000_01BFC0EA.8A9DC960--

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

Wishing to try understand the new Unicode features in Perl 5.6\, I started playing.

Lacking a decent unicode editor on my Solaris system I foolishly fired up Notepad in Windows 2000.

I saved my happy perl script as a utf8 file.

I got the following error.

Unrecognized character \xEF at D​:\perlscripts\unicode.pl line 1.

What's that? I don't remember typing in \xEF ...

So I fired up od to look at the file. Say there are a few suspicious bytes at the beginning of the file.

Turns out\, after some annoying reasearch these characters are "Byte Order Marks".

OK\, I didn't ask for them\, but at least now I understand.

So should Perl ignore the BOM at the beginning of a Perl script? If so I found a bug\, if not I would like to make a feature request.

The Unicode BOM is never a valid character so it would always be safe to ignore as far as I am concerned.

The Perl script is attached. You may wish to use 'od' to examine the contents\, especially if you are using Notepad from Microsoft.

Let me know if you have further questions. I would like to know what the outcome is.

thanks\,

tim burlowski 5/18/19100

__DATA__

Output from 'perlbug -d' if that's useful. --- Flags​: category= severity= --- Site configuration information for perl v5.6.0​:

Configured by txb at Thu Apr 27 13​:33​:42 CDT 2000.

Summary of my perl5 (revision 5.0 version 6 subversion 0) configuration​: Platform​: osname=solaris\, osvers=2.6\, archname=sun4-solaris uname='sunos mum 5.6 generic_105181-05 sun4u sparc sunw\,ultra-5_10 ' config_args='' hint=recommended\, useposix=true\, d_sigaction=define usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef useperlio=undef d_sfio=undef uselargefiles=define use64bitint=undef use64bitall=undef uselongdouble=undef usesocks=undef Compiler​: cc='cc'\, optimize='-O'\, gccversion= cppflags='' ccflags =' -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64' stdchar='unsigned char'\, d_stdstdio=define\, usevfork=false intsize=4\, longsize=4\, ptrsize=4\, doublesize=8 d_longlong=define\, longlongsize=8\, d_longdbl=define\, longdblsize=16 ivtype='long'\, ivsize=4\, nvtype='double'\, nvsize=8\, Off_t='off_t'\, lseeksize=8 alignbytes=8\, usemymalloc=y\, prototype=define Linker and Libraries​: ld='cc'\, ldflags ='-L/opt/SUNWspro/SC4.2/lib ' libpth=/usr/local/lib /opt/SUNWspro/SC4.2/lib /lib /usr/lib /usr/ccs/lib libs=-lsocket -lnsl -ldl -lm -lc -lcrypt -lsec libc=\, so=so\, useshrplib=false\, libperl=libperl.a Dynamic Linking​: dlsrc=dl_dlopen.xs\, dlext=so\, d_dlsymun=undef\, ccdlflags=' ' cccdlflags='-KPIC'\, lddlflags='-G -L/opt/SUNWspro/SC4.2/lib'

Locally applied patches​:

--- @​INC for perl v5.6.0​: /opt/perl56/lib/5.6.0/sun4-solaris /opt/perl56/lib/5.6.0 /opt/perl56/lib/site_perl/5.6.0/sun4-solaris /opt/perl56/lib/site_perl/5.6.0 /opt/perl56/lib/site_perl .

--- Environment for perl v5.6.0​: HOME=[snip] LANG=C LANGUAGE (unset) LD_LIBRARY_PATH= [snip] LOGDIR (unset) PATH= [snip] PERL_BADLANG (unset) SHELL=/bin/csh

------=_NextPart_000_0000_01BFC0EA.8A9DC960 Content-Type​: application/octet-stream; name="unicode.pl" Content-Transfer-Encoding​: quoted-printable Content-Disposition​: attachment; filename="unicode.pl"

=EF=BB=BF#!/usr/local/perl/bin

# Test of new unicode pragma in Perl 5.6

use utf8; use strict; use warnings;

while (\) { my $len =3D length; print "line is $len characters long\n"; }

__DATA__ TEST =EF=BC=B4=EF=BC=A5=EF=BC=B3=EF=BC=B4 ------=_NextPart_000_0000_01BFC0EA.8A9DC960--

p5pRT commented 24 years ago

From @gsar

On Wed\, 07 Jun 2000 15​:26​:42 +0200\, Richard Foley wrote​:

So should Perl ignore the BOM at the beginning of a Perl script?

Yes\, most definitely.

In case someone wants to fix this\, the place to patch would be toke.c near the Perl_croak().

Sarathy gsar@​ActiveState.com

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

Useful information on Byte Order Mark(s) can be found at http​://www.unicode.org/unicode/faq/ in the sections towards the bottom labeled under the heading BOM.

tim burlowski June 7\,2000 -----Original Message----- From​: Gurusamy Sarathy [mailto​:gsar@​activestate.com] Sent​: Wednesday\, June 07\, 2000 10​:57 AM To​: Tim Burlowski Cc​: perl5-porters@​perl.org Subject​: Re​: [ID 20000518.005] New unicode support / bug or feature request

On Wed\, 07 Jun 2000 15​:26​:42 +0200\, Richard Foley wrote​:

So should Perl ignore the BOM at the beginning of a Perl script?

Yes\, most definitely.

In case someone wants to fix this\, the place to patch would be toke.c near the Perl_croak().

Sarathy gsar@​ActiveState.com

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

Richard Foley (lists.p5p)​:

Lacking a decent unicode editor on my Solaris system I foolishly fired up Notepad in Windows 2000.

I saved my happy perl script as a utf8 file.

More likely as a UTF-16 file\, since that's what Windows Notepad saves it as. (here\, at least\, but I'm using the Japanese Windows so my mileage may vary.)

Skipping BOMs is trivial and I'll have a patch in a few hours; reading UTF-16 program files is a bigger problem. Do we want to tackle this yet?

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

Gurusamy Sarathy (lists.p5p)​:

In case someone wants to fix this\, the place to patch would be toke.c near the Perl_croak().

Are you sure? I'd have thought you'd want to process the BOM before the shebang line\, since it has to be the first thing in the file. If you process it up near the croak\, #! won't be the first thing on line one\, and your shebangs won't work more. Here's how I'd do it\, anyway​:

Inline Patch ```diff --- toke.c~ Thu Jun 08 23:48:35 2000 +++ toke.c Fri Jun 09 01:59:43 2000 @@ -2509,6 +2509,22 @@ } PL_bufend = SvPVX(PL_linestr) + SvCUR(PL_linestr); if (CopLINE(PL_curcop) == 1) { + /* Detect BOMs */ + if (*s == -1 && s[1] == -2) { /* UTF-16 little-endian */ + s+=2; + if (*s == 0 && s[1] == 0) /* UTF-32 little-endian */ + s+=2; + } + + if (*s == 0 && s[1] == 0 && s[2] == -2 && s[3] == -1) + s+=4; /* UTF-32 big-endian */ + + if (*s == -2 && s[1] == -1) /* UTF-16 big-endian */ + s+=2; + + if (*s == -17 && s[1] == 187-256 && s[2] == 191-256) + s+=3; /* UTF-8 */ + while (s < PL_bufend && isSPACE(*s)) s++; if (*s == ':' && s[1] != ':') /* for csh execing sh scripts */ ```
p5pRT commented 24 years ago

From @gsar

On 08 Jun 2000 04​:35​:46 GMT\, Simon Cozens wrote​:

More likely as a UTF-16 file\, since that's what Windows Notepad saves it as. (here\, at least\, but I'm using the Japanese Windows so my mileage may vary.)

The notepad that comes with Windows 2000 can save files as UTF-8.

Skipping BOMs is trivial and I'll have a patch in a few hours; reading UTF-16 program files is a bigger problem. Do we want to tackle this yet?

It may be easier than you think--grep for utf16_textfilter(). All it probably needs is a filter_{add\,del}() at a strategic location.

Sarathy gsar@​ActiveState.com

p5pRT commented 24 years ago

From @gsar

On 08 Jun 2000 05​:16​:33 GMT\, Simon Cozens wrote​:

Gurusamy Sarathy (lists.p5p)​:

In case someone wants to fix this\, the place to patch would be toke.c near the Perl_croak().

Are you sure? I'd have thought you'd want to process the BOM before the shebang line\, since it has to be the first thing in the file.

I thought about that\, but you may be better off handling it as an exceptional case and jumping back to case 0 if you happen to be at the beginning of the file. IOW\, pretend you're at the beginning again after you've consumed the BOM.

--- toke.c~ Thu Jun 08 23​:48​:35 2000 +++ toke.c Fri Jun 09 01​:59​:43 2000 @​@​ -2509\,6 +2509\,22 @​@​ } PL_bufend = SvPVX(PL_linestr) + SvCUR(PL_linestr); if (CopLINE(PL_curcop) == 1) {

I haven't checked\, but this might be spoofed by​:

  #line 1 "gotcha"   BOMBOMBOMBOOOOOM

+ /* Detect BOMs */ + if (*s == -1 && s[1] == -2) { /* UTF-16 little-endian */ + s+=2; + if (*s == 0 && s[1] == 0) /* UTF-32 little-endian */ + s+=2;

I we're not translating UTF-16 and UTF-32\, we should croak here. Or\, better to move it where the other croak happens.

Or\, try adding a filter_add() here and see what happens.

+ } + + if (*s == 0 && s[1] == 0 && s[2] == -2 && s[3] == -1)

This might need a strlen(s) > 3 check.

+ s+=4; /* UTF-32 big-endian */ + + if (*s == -2 && s[1] == -1) /* UTF-16 big-endian */ + s+=2; + + if (*s == -17 && s[1] == 187-256 && s[2] == 191-256) + s+=3; /* UTF-8 */

Sarathy gsar@​ActiveState.com

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

Gurusamy Sarathy (lists.p5p)​:

I thought about that\, but you may be better off handling it as an exceptional case and jumping back to case 0 if you happen to be at the beginning of the file.

If it isn't at the start of line one in a file\, it isn't a BOM; it's just an ordinary ZWNBSP​:

  In the absence of such protocols and when not at the beginning of a   text stream\, U+FEFF is given its normal interpretation\, as ZERO WIDTH   NON-BREAKING SPACE\, and is part of the content of the file or string.

Hence\, we do need to check to see if we're at the beginning of the first line of input before consuming a BOM.

I haven't checked\, but this might be spoofed by​: #line 1 "gotcha" BOMBOMBOMBOOOOOM

That's a feature. No\, seriously\, it is - it allows you to combine UTF encoded and non-encoded code. By declaring yourself to be at the beginning of file and dropping a BOM\, you're saying "Everything below here is UTF encoded". The alternative explanation would be that it's a ZWNBSP on pseudo-"line 1" and should be ignored.

It may be easier than you think--grep for utf16_textfilter(). All it probably needs is a filter_{add\,del}() at a strategic location.

Good golly. It works. We have to remember to convert the buffer we've already read from UTF16 down to UTF8\, and then add the filter\, but after that\, it Just Works. I've made it a compile-time flag\, but I've tested it and it works perfectly.

Inline Patch ```diff --- toke.c~ Thu Jun 08 23:48:35 2000 +++ toke.c Fri Jun 09 05:42:36 2000 @@ -163,6 +163,9 @@ /* grandfather return to old style */ #define OLDLOP(f) return(yylval.ival=f,PL_expect = XTERM,PL_bufptr = ```

s,(int)LSTOP)

+/* Are we at the beginning of a file? (or pretending to be) */ +#define AT_BOF(x) (CopLINE(PL_curcop) == 1) && (x == PL_linestart) + /*   * S_ao   * @​@​ -326\,7 +329\,7 @​@​ } #endif

-#if 0 +#ifdef PERL_UTF16_FILTER STATIC I32 S_utf16_textfilter(pTHX_ int idx\, SV *sv\, int maxlen) { @​@​ -2383\,14 +2386\,62 @​@​

  retry​:   switch (*s) { + /* Detect BOMs */ + case -1​: + if (AT_BOF(s) && (s[1] & 255) == 254) { + /* UTF-16 little-endian */ + U8 *news; + s+=2; + if (*s == 0 && s[1] == 0) /* UTF-32 little-endian */ + Perl_croak(aTHX_ "Unsupported script encoding"); +#ifdef PERL_UTF16_FILTER + filter_add(S_utf16rev_textfilter\, NULL); + New(898\, news\, (PL_bufend - s) * 3 / 2 + 1\, U8); + PL_bufend = utf16_to_utf8((U16*)s\, news\, PL_bufend - s); + s = news; + goto bof; +#else + Perl_croak(aTHX_ "Unsupported script encoding"); +#endif + } + goto badchar; + + case -2​: + if ( AT_BOF(s) && (s[1] & 255) == 255) { /* UTF-16 big-endian */ +#ifdef PERL_UTF16_FILTER + U8 *news; + filter_add(S_utf16_textfilter\, NULL); + New(898\, news\, (PL_bufend - s) * 3 / 2 + 1\, U8); + PL_bufend = utf16_to_utf8((U16*)s\, news\, PL_bufend - s); + s = news; + goto bof; +#else + Perl_croak(aTHX_ "Unsupported script encoding"); +#endif + } + goto badchar; + + case -17​: + if ( AT_BOF(s) && strlen(s)>2 && (s[1] & 255) == 187 && (s[2] & 255) == 191) { + s+=3; /* UTF-8 */ + goto bof; + } + /* FALL THROUGH */ +   default​:   if (isIDFIRST_lazy_if(s\,UTF))   goto keylookup; + badchar​:   Perl_croak(aTHX_ "Unrecognized character \\x%02X"\, *s & 255);   case 4​:   case 26​:   goto fake_eof; /* emulate EOF on ^D or ^Z */   case 0​: + /* One last BOM check... */ + if (strlen(s) > 3 && s[1] == 0 && /* UTF-32 big-endian */ + s[2] & 255 == 254 && s[3] & 255 == 255) + Perl_croak(aTHX_ "Unsupported script encoding"); +   if (!PL_rsfp) {   PL_last_uni = 0;   PL_last_lop = 0; @​@​ -2509\,6 +2560\,7 @​@​   }   PL_bufend = SvPVX(PL_linestr) + SvCUR(PL_linestr);   if (CopLINE(PL_curcop) == 1) { + bof​:   while (s \< PL_bufend && isSPACE(*s))   s++;   if (*s == '​:' && s[1] != '​:') /* for csh execing sh scripts */

Inline Patch ```diff --- pod/perldiag.pod~ Thu Jun 8 18:53:26 2000 +++ pod/perldiag.pod Thu Jun 8 18:55:44 2000 @@ -3401,6 +3401,11 @@ of Perl executables, some of which may support fork, some not. Try changing the name you call Perl by to C, C, and so on. +=item Unsupported script encoding + +(F) Your program file begins with a Unicode Byte Order Mark (BOM) which +declares it to be in a Unicode encoding that Perl cannot yet read. + =item Unsupported socket function "%s" called (F) Your machine doesn't support the Berkeley socket mechanism, or at ```
p5pRT commented 24 years ago

From @gsar

On 08 Jun 2000 09​:57​:19 GMT\, Simon Cozens wrote​:

Gurusamy Sarathy (lists.p5p)​:

I thought about that\, but you may be better off handling it as an exceptional case and jumping back to case 0 if you happen to be at the beginning of the file.

If it isn't at the start of line one in a file\, it isn't a BOM; it's just an ordinary ZWNBSP​:

s/file/stream/

In the absence of such protocols and when not at the beginning of a text stream\, U+FEFF is given its normal interpretation\, as ZERO WIDTH NON-BREAKING SPACE\, and is part of the content of the file or string.

Hence\, we do need to check to see if we're at the beginning of the first line of input before consuming a BOM.

I haven't checked\, but this might be spoofed by​: #line 1 "gotcha" BOMBOMBOMBOOOOOM

That's a feature. No\, seriously\, it is - it allows you to combine UTF encoded and non-encoded code. By declaring yourself to be at the beginning of file and dropping a BOM\, you're saying "Everything below here is UTF encoded". The alternative explanation would be that it's a ZWNBSP on pseudo-"line 1" and should be ignored.

This superficially sounds good\, but we may be breaking the BOM rules here\, and applications (including perl scripts) will probably croak on such files because they're not well-formed encodings. I don't think we want to be encouraging such breakage by claiming to "support" such files.

It may be easier than you think--grep for utf16_textfilter(). All it probably needs is a filter_{add\,del}() at a strategic location.

Good golly. It works.

Cool.

@​@​ -2383\,14 +2386\,62 @​@​

retry​: switch (*s) { + /* Detect BOMs */ + case -1​: + if (AT_BOF(s) && (s[1] & 255) == 254) { + /* UTF-16 little-endian */ + U8 *news; + s+=2; + if (*s == 0 && s[1] == 0) /* UTF-32 little-endian */

Needs a length check (e.g. an empty file).

+ Perl_croak(aTHX_ "Unsupported script encoding"); +#ifdef PERL_UTF16_FILTER + filter_add(S_utf16rev_textfilter\, NULL);

These filters may need to be filter_del()eted somewhere. Maybe it's ok without it too\, I haven't checked.

+ /* One last BOM check... */ + if (strlen(s) > 3 && s[1] == 0 && /* UTF-32 big-endian */

Those two will never be true at the same time.

These length checks should really be done via PL_bufend rather than strlen() (sorry for the misleading pseudocode earlier).

Thanks!

Sarathy gsar@​ActiveState.com

p5pRT commented 20 years ago

From The RT System itself

fixed in the unstable developmental release perl 5.7.0\, hopefully will be fixed in perl 5.6.1