Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.91k stars 542 forks source link

Regex code won't work on non-ASCII platforms #10042

Closed p5pRT closed 9 years ago

p5pRT commented 14 years ago

Migrated from rt.perl.org#71728 (status was 'resolved')

Searchable as RT71728$

p5pRT commented 14 years ago

From @khwilliamson

This is a bug report for perl from khw@​khw-desktop.nonet\, generated with the help of perlbug 1.39 running under perl 5.11.3.


There are a number of hard-coded constants in the regex code that assume the platform is ASCII.



Flags​:   category=core   severity=low


Site configuration information for perl 5.11.3​:

Configured by khw at Tue Dec 29 12​:45​:43 MST 2009.

Summary of my perl5 (revision 5 version 11 subversion 3) configuration​:   Commit id​: 9f815e241cf04d04fc645970753438216a0ed024   Platform​:   osname=linux\, osvers=2.6.27-16-generic\, archname=i686-linux   uname='linux khw-desktop 2.6.27-16-generic #1 smp tue dec 1 17​:56​:54 utc 2009 i686 gnulinux '   config_args='-s -d -Dprefix=/home/khw/fastbleadperl -Dusedevel'   hint=recommended\, useposix=true\, d_sigaction=define   useithreads=undef\, usemultiplicity=undef   useperlio=define\, d_sfio=undef\, uselargefiles=define\, usesocks=undef   use64bitint=undef\, use64bitall=undef\, uselongdouble=undef   usemymalloc=n\, bincompat5005=undef   Compiler​:   cc='cc'\, ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'\,   optimize='-O2'\,   cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'   ccversion=''\, gccversion='4.3.2'\, gccosandvers=''   intsize=4\, longsize=4\, ptrsize=4\, doublesize=8\, byteorder=1234   d_longlong=define\, longlongsize=8\, d_longdbl=define\, longdblsize=12   ivtype='long'\, ivsize=4\, nvtype='double'\, nvsize=8\, Off_t='off_t'\, lseeksize=8   alignbytes=4\, prototype=define   Linker and Libraries​:   ld='cc'\, ldflags =' -fstack-protector -L/usr/local/lib'   libpth=/usr/local/lib /lib /usr/lib   libs=-lnsl -ldl -lm -lcrypt -lutil -lc   perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc   libc=/lib/libc-2.8.90.so\, so=so\, useshrplib=false\, libperl=libperl.a   gnulibc_version='2.8.90'   Dynamic Linking​:   dlsrc=dl_dlopen.xs\, dlext=so\, d_dlsymun=undef\, ccdlflags='-Wl\,-E'   cccdlflags='-fPIC'\, lddlflags='-shared -O2 -L/usr/local/lib -fstack-protector'

Locally applied patches​:


@​INC for perl 5.11.3​:   /home/khw/fastbleadperl/lib/site_perl/5.11.3/i686-linux   /home/khw/fastbleadperl/lib/site_perl/5.11.3   /home/khw/fastbleadperl/lib/5.11.3/i686-linux   /home/khw/fastbleadperl/lib/5.11.3   .


Environment for perl 5.11.3​:   HOME=/home/khw   LANG=en_US.UTF-8   LANGUAGE (unset)   LD_LIBRARY_PATH (unset)   LOGDIR (unset)

PATH=/home/khw/bin​:/home/khw/print/bin​:/bin​:/usr/local/sbin​:/usr/local/bin​:/usr/sbin​:/usr/bin​:/sbin​:/usr/games​:/opt/real/RealPlayer​:/home/khw/cxoffice/bin   PERL_BADLANG (unset)   SHELL=/bin/ksh

p5pRT commented 12 years ago

From @jkeenan

On Tue Dec 29 12​:51​:53 2009\, public@​khwilliamson.com wrote​:

----------------------------------------------------------------- There are a number of hard-coded constants in the regex code that assume the platform is ASCII. -----------------------------------------------------------------

Could you cite specific instances so that people might be encouraged to start looking at them?

Thank you very much. Jim Keenan

p5pRT commented 12 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 12 years ago

From @khwilliamson

In looking at the code under EBCDIC\, it appears\, for example\, that qr/\x{df}/ matches the character whose EBCDIC ordinal is 0xdf (LATIN SMALL LETTER_Y WITH DIAERESIS). However\, the double-quoted string "\x{df}" is the character whose Latin1 ordinal is 0xdf (LATIN SMALL LETTER SHARP S).

There are advantages and disadvantages to each interpretation. The documentation indicates that the first one is correct\, and I agree\, because we should not expect someone whose language is EBCDIC to use a non-native tongue\, anymore than someone familiar with ASCII should have to think in terms of EBCDIC.

But it is crazy to have opposite interpretations depending on whether it is a double-quoted string or a regex.

I have been thinking some about EBCDIC lately. Anything we do to get it working is essentially useless unless we have long term access to a smoker\, unless there is almost no special casing needed for it.

And\, almost all of that special casing in the core would go away if the property tables generated by mktables were in EBCDIC instead of Latin1 (the tables being identical above these ranges). It is actually quite easy to change mktables to do this. I believe that the only special-case parts that would remain that I am familiar with would be the necessity of translating \N{U+DF} when parsing into the EBCDIC for LATIN SMALL LETTER SHARP S (which ix 0x59)\, the turning of /[a-z]/ (and similarly for tr/a-z/.../) into the proper sub-ranges\, since these are not contiguous code points in EBCDIC\, and the correct C compile-time generation of UTF-8 vs UTF-EBCDIC strings\, which are mostly used in assertions. I think I have figured out some macros that will handle this last case.

I just grepped through the core for EBCDIC\, and found some places where it is used. It looks like most of these would go away with a suitable macro.

I have not investigated the implications of making this change on CPAN modules\, but I think that the benefits would out weigh any of those costs.

p5pRT commented 12 years ago

From @nwc10

On Sun\, Jun 03\, 2012 at 09​:15​:33PM -0600\, Karl Williamson wrote​:

I have been thinking some about EBCDIC lately. Anything we do to get it working is essentially useless unless we have long term access to a smoker\, unless there is almost no special casing needed for it.

Yes. Even without full understanding or direct access\, it's much easier to fix things and keep it working if we get prompt feedback when we inadvertently broken something.

I have not investigated the implications of making this change on CPAN modules\, but I think that the benefits would out weigh any of those costs.

My hunch is that even fewer modules on CPAN work on EBCDIC than work on VMS\, an ASCII-based platform. I'd not worry about it.

[And that anyone who wants to cause consternation in authors playing the CPAN stats game should set up a CPAN smoker on VMS :-)]

Nicholas Clark

p5pRT commented 12 years ago

From @craigberry

On Mon\, Jun 4\, 2012 at 9​:04 AM\, Nicholas Clark \nick@​ccl4\.org wrote​:

On Sun\, Jun 03\, 2012 at 09​:15​:33PM -0600\, Karl Williamson wrote​:

I have been thinking some about EBCDIC lately.  Anything we do to get it working is essentially useless unless we have long term access to a smoker\, unless there is almost no special casing needed for it.

Yes. Even without full understanding or direct access\, it's much easier to fix things and keep it working if we get prompt feedback when we inadvertently broken something.

I have not investigated the implications of making this change on CPAN modules\, but I think that the benefits would out weigh any of those costs.

My hunch is that even fewer modules on CPAN work on EBCDIC than work on VMS\, an ASCII-based platform. I'd not worry about it.

[And that anyone who wants to cause consternation in authors playing the CPAN stats game should set up a CPAN smoker on VMS :-)]

Probably a lot of the CPAN failures on VMS are due to the use of Module​::Install\, which doesn't work. Sometimes they can be corrected by simply writing a proper Makefile.PL.

p5pRT commented 12 years ago

From @jhi

I have been thinking some about EBCDIC lately.  Anything we do to get it working is essentially useless unless we have long term access to a smoker\, unless there is almost no special casing needed for it.

Yes. Even without full understanding or direct access\, it's much easier to fix things and keep it working if we get prompt feedback when we inadvertently broken something.

Not saying that having regular EBCDIC smoke wouldn't be wonderful... but there's another angle for the drive (ok\, I admit it\, my mania) for being able to run in non-Intel-non-Linux​: for 99% of the code in CPAN - the code SHOULD NOT BREAK even if it's running in an exotic environment. It's application level code. Perl and the standard libraries should insulate application level code from minutiae like character sets\, encodings\, word size\, byte orders\, and such. If the application code cannot do this without breaking\, Perl has failed to provide adequate cushioning\, abstraction\, layering.

-- There is this special biologist word we use for 'stable'. It is 'dead'. -- Jack Cohen

p5pRT commented 12 years ago

From @ap

* Karl Williamson \public@​khwilliamson\.com [2012-06-04 05​:20]​:

The documentation indicates that the first one is correct\, and I agree\, because we should not expect someone whose language is EBCDIC to use a non-native tongue\, anymore than someone familiar with ASCII should have to think in terms of EBCDIC.

If I upload code written on a Linux machine to CPAN and the code is then installed on a z/OS machine\, what should the \x{df} I wrote in that code mean?

I think we have a problem.

Because of the conflation of bytes and characters\, ultimately.

Thus\,

But it is crazy to have opposite interpretations depending on whether it is a double-quoted string or a regex.

…it may be equally crazy to have only one interpretation as to have two.

Regards\, -- Aristotle Pagaltzis // \<http​://plasmasturm.org/>

p5pRT commented 12 years ago

From @khwilliamson

On 06/05/2012 12​:40 PM\, Aristotle Pagaltzis wrote​:

* Karl Williamson\public@&#8203;khwilliamson\.com [2012-06-04 05​:20]​:

The documentation indicates that the first one is correct\, and I agree\, because we should not expect someone whose language is EBCDIC to use a non-native tongue\, anymore than someone familiar with ASCII should have to think in terms of EBCDIC.

If I upload code written on a Linux machine to CPAN and the code is then installed on a z/OS machine\, what should the \x{df} I wrote in that code mean?

I think we have a problem.

A module writer who aims to have character set-independent code must take greater care than someone who is writing only for machines which have the same character set as s/he is accustomed to.

We offer \N{} to allow someone to specify a character machine-independently. \x{df} is much more obscure than the character name anyway\, but we do offer \N{U+df} to allow you to specify the Unicode codepoint. On EBCDIC machines\, that character has ordinal 0x59.

The downside of \N{} is that it always converts its containing string to UTF-8\, otherwise there is a chance that strings containing ordinals 128-255 won't be interpreted correctly. This is a holdover from before the unicode_strings feature came along\, but I believe it to be premature to remove it.

Because of the conflation of bytes and characters\, ultimately.

Thus\,

But it is crazy to have opposite interpretations depending on whether it is a double-quoted string or a regex.

…it may be equally crazy to have only one interpretation as to have two.

Much of our test code assumes that ord("A") is 193 on an EBCDIC machine.   One would hope that chr(ord("A")) is "A" no matter what the platform\, thus chr(193) should be "A"\, as should therefore chr(0xC1). Hence\, "\xC1" should be "A"\, and /\xC1/ should also match "A". Otherwise\, its crazy.

p5pRT commented 9 years ago

From @khwilliamson

There are now no known EBCDIC-specific regex bugs -- Karl Williamson

p5pRT commented 9 years ago

@khwilliamson - Status changed from 'open' to 'pending release'

p5pRT commented 9 years ago

From friedberg@exs.esb.com

To Karl and the list\,

I'm a lurker and user of OpenVMS. I've used IBM mainframes in the past\, but not for many years.

I'm amazed by the effort you and a few others have put into bringing z os back to life as far as perl is concerned.

Congratulations\, and thanks to all of you for so much hard work!

Carl Friedberg friedberg@​esb.com (212) 798-0718 www.esb.com The Elias Book of Baseball Records 2015 Edition

-----Original Message----- From​: Karl Williamson via RT [mailto​:perlbug-followup@​perl.org] Sent​: Monday\, March 23\, 2015 9​:08 AM Cc​: perl5-porters@​perl.org Subject​: [perl #71728] Regex code won't work on non-ASCII platforms

There are now no known EBCDIC-specific regex bugs -- Karl Williamson


via perlbug​: queue​: perl5 status​: open https://rt-archive.perl.org/perl5/Ticket/Display.html?id=71728

p5pRT commented 9 years ago

From @khwilliamson

Thanks for submitting this ticket

The issue should be resolved with the release today of Perl v5.22. If you find that the problem persists\, feel free to reopen this ticket

-- Karl Williamson for the Perl 5 porters team

p5pRT commented 9 years ago

@khwilliamson - Status changed from 'pending release' to 'resolved'