Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.85k stars 527 forks source link

no utf8 breaks regexey #3532

Closed p5pRT closed 20 years ago

p5pRT commented 23 years ago

Migrated from rt.perl.org#5982 (status was 'resolved')

Searchable as RT5982$

p5pRT commented 23 years ago

From root@plan9.de

This​:

  $x = "a\x{1234}"; no utf8; $x =~ m/\w/;

segfaults in regexec.c because it calls a swash function with PL_utf8_alnum which is NULL at that point. It seems that regcomp.c contains some hacks that call Perl_is_utf8_alnum to initialize the swash but these are only called when "use utf8" is in effect.

In the above example\, it gets never called\, thus the NULL pointer access.

Perl Info ``` Flags: category=core severity=medium Site configuration information for perl v5.7.0: Configured by root at Tue Mar 6 16:58:04 CET 2001. Summary of my perl5 (revision 5.0 version 7 subversion 0) configuration: Platform: osname=linux, osvers=2.2, archname=i686-linux uname='linux cerebro 2.2.17 #1 smp mon oct 16 00:47:15 cest 2000 i686 unknown ' config_args='' hint=previous, useposix=true, d_sigaction=define usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef Compiler: cc='gcc', ccflags ='-I/usr/local/include -I/usr/app/include -I/opt/include', optimize='-g -O3 -fno-omit-frame-pointer -mpentiumpro', cppflags='-I/usr/local/include -I/usr/app/include -I/opt/include' ccversion='', gccversion='2.95.2.1 19991024 (release)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=4 alignbytes=4, usemymalloc=y, prototype=define Linker and Libraries: ld='gcc', ldflags ='-L/usr/local/lib -L/opt/lib' libpth=/usr/local/lib /lib /usr/lib /opt/lib libs=-lc -ldl -lm -lcrypt perllibs=-lc -ldl -lm -lcrypt libc=/lib/libc-2.1.3.so, so=so, useshrplib=false, libperl=libperl.a Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic' cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib -L/opt/lib' Locally applied patches: DEVEL9021 @INC for perl v5.7.0: /usr/app/lib/perl5 /usr/app/lib/perl5 /usr/app/lib/perl5 /usr/app/lib/perl5 . Environment for perl v5.7.0: HOME=/root LANG (unset) LANGUAGE (unset) LC_CTYPE=de_DE LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/root/s:/opt/qt/bin:/bin:/usr/bin:/usr/app/bin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/usr/app/bin:/usr/app/sbin:/usr/X11/bin:/opt/jdk118/bin:/opt/bin:/opt/sbin:.:/root/cc/dejagnu/bin PERLDB_OPTS=ornaments=0 PERL_BADLANG (unset) SHELL=/bin/bash ```
p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

\pcg@​goof\.com writes​:

This is a bug report for perl from pcg@​goof.com\, generated with the help of perlbug 1.33 running under perl v5.7.0.

----------------------------------------------------------------- [Please enter your report here]

This​:

$x = "a\x{1234}"; no utf8; $x =~ m/\w/; segfaults in regexec.c because it calls a swash function with

Ick. More to the point

cd bleadperl ./perl -Ilib -e ' $x = "a\x{1234}";$x =~ m/\w/;'

Does it too.

Pondering why is regexec.c calling swash_fetch() rather than is_utf8_alnum() one realizes that it is because swashes use modules and we need regexps to load them - oh dear ...

PL_utf8_alnum which is NULL at that point. It seems that regcomp.c contains some hacks that call Perl_is_utf8_alnum to initialize the swash but these are only called when "use utf8" is in effect.

Which made sense once\, but no longer. Until we re-design swashes it seems we always have to init them in regcomp.c

In the above example\, it gets never called\, thus the NULL pointer access.

[Please do not change anything below this line] ----------------------------------------------------------------- --- Flags​: category=core severity=medium --- Site configuration information for perl v5.7.0​:

Configured by root at Tue Mar 6 16​:58​:04 CET 2001.

Summary of my perl5 (revision 5.0 version 7 subversion 0) configuration​: Platform​: osname=linux\, osvers=2.2\, archname=i686-linux uname='linux cerebro 2.2.17 #1 smp mon oct 16 00​:47​:15 cest 2000 i686 unknown ' config_args='' hint=previous\, useposix=true\, d_sigaction=define usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef Compiler​: cc='gcc'\, ccflags ='-I/usr/local/include -I/usr/app/include -I/opt/include'\, optimize='-g -O3 -fno-omit-frame-pointer -mpentiumpro'\, cppflags='-I/usr/local/include -I/usr/app/include -I/opt/include' ccversion=''\, gccversion='2.95.2.1 19991024 (release)'\, gccosandvers='' intsize=4\, longsize=4\, ptrsize=4\, doublesize=8\, byteorder=1234 d_longlong=define\, longlongsize=8\, d_longdbl=define\, longdblsize=12 ivtype='long'\, ivsize=4\, nvtype='double'\, nvsize=8\, Off_t='off_t'\, lseeksize=4 alignbytes=4\, usemymalloc=y\, prototype=define Linker and Libraries​: ld='gcc'\, ldflags ='-L/usr/local/lib -L/opt/lib' libpth=/usr/local/lib /lib /usr/lib /opt/lib libs=-lc -ldl -lm -lcrypt perllibs=-lc -ldl -lm -lcrypt libc=/lib/libc-2.1.3.so\, so=so\, useshrplib=false\, libperl=libperl.a Dynamic Linking​: dlsrc=dl_dlopen.xs\, dlext=so\, d_dlsymun=undef\, ccdlflags='-rdynamic' cccdlflags='-fpic'\, lddlflags='-shared -L/usr/local/lib -L/opt/lib'

Locally applied patches​: DEVEL9021

--- @​INC for perl v5.7.0​: /usr/app/lib/perl5 /usr/app/lib/perl5 /usr/app/lib/perl5 /usr/app/lib/perl5 .

--- Environment for perl v5.7.0​: HOME=/root LANG (unset) LANGUAGE (unset) LC_CTYPE=de_DE LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/root/s​:/opt/qt/bin​:/bin​:/usr/bin​:/usr/app/bin​:/bin​:/sbin​:/usr/bin​:/usr/sbin​:/usr/local/bin​:/usr/local/sbin​:/usr/app/bin​:/usr/app/sbin​:/usr/X11/bin​:/opt/jdk118/bin​:/opt/bin​:/opt/sbin​:.​:/root/cc/dejagnu/bin PERLDB_OPTS=ornaments=0 PERL_BADLANG (unset) SHELL=/bin/bash

p5pRT commented 23 years ago

From @jhi

On Tue\, Mar 06\, 2001 at 09​:15​:18PM +0000\, nick@​ing-simmons.net wrote​:

\pcg@​goof\.com writes​:

This is a bug report for perl from pcg@​goof.com\, generated with the help of perlbug 1.33 running under perl v5.7.0.

----------------------------------------------------------------- [Please enter your report here]

This​:

$x = "a\x{1234}"; no utf8; $x =~ m/\w/; segfaults in regexec.c because it calls a swash function with

Ick. More to the point

cd bleadperl ./perl -Ilib -e ' $x = "a\x{1234}";$x =~ m/\w/;'

Does it too.

Pondering why is regexec.c calling swash_fetch() rather than is_utf8_alnum() one realizes that it is because swashes use modules and we need regexps to load them - oh dear ...

Yup. Using regexps to load regexps might be cute and easier to implement but other than that I'm not that enthused about it.

I do actually hope that Ilya's recent couple of regex-Unicode patches on 5.6.1-to-be will work on bleadperl too. Since Ilya's patches are basically much\, much smaller than my big 'polymorphic' patch\, there's much less chance of old hacks like the swashes breaking.

PL_utf8_alnum which is NULL at that point. It seems that regcomp.c contains some hacks that call Perl_is_utf8_alnum to initialize the swash but these are only called when "use utf8" is in effect.

Which made sense once\, but no longer. Until we re-design swashes it seems we always have to init them in regcomp.c

p5pRT commented 23 years ago

From @jhi

Which made sense once\, but no longer. Until we re-design swashes it seems we always have to init them in regcomp.c

Not quite. I now added a workaround to regexec.c (patch #9098) (essentially\, the PL_utf8_blah are now loaded really on-demand on-the-spot)

Incidentally\, I also took a look at using Ilya's recent simple regex utf8 patches instead of my humongous polymorphism patches. I would at least like to give the new patches a try -- but it looks damned hard to go back after 850 or so patches (about 30 touching directly reg*.[hc])\, especially because of the ongoing other Unicode work\, many of the patches touching reg*.[hc] touch also the Unicode in general\, and vice versa (extra fun supplied by the curious trick that was part of this very bug 20010306.008​: that using things like \w under Unicode requires runtime loading of files and parsing their contents\, using regexen...)

I just now tried for several hours to unpatch and patch backwards\, forwards\, and sidewise\, without getting nowhere near functional regex code (not even venturing into utf8 lands)\, so I gave up. I'm not saying there's no going back to the old pre-polymorphic (patch #8143) regex code\, I'm just saying that getting there is more pain than I am prepared to face\, especially since what we have now seems to work\, for large enough values of "work".

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

Jarkko Hietaniemi \jhi@​iki\.fi writes​:

Incidentally\, I also took a look at using Ilya's recent simple regex utf8 patches instead of my humongous polymorphism patches. I would at least like to give the new patches a try -- but it looks damned hard to go back after 850 or so patches (about 30 touching directly reg*.[hc])\, especially because of the ongoing other Unicode work\, many of the patches touching reg*.[hc] touch also the Unicode in general\, and vice versa (extra fun supplied by the curious trick that was part of this very bug 20010306.008​: that using things like \w under Unicode requires runtime loading of files and parsing their contents\, using regexen...)

Do Ilya's patches work well in 5.6 branch?

I ask because regexp/regcomp are a remaining area that needs attention in EBCDIC work. I _think_ regexp stuff is intended to work in Unicode space rather than warped space in the EBCDIC case. But I don't want to waste my efforts figuring out how it works if how it works is going to change.

I just now tried for several hours to unpatch and patch backwards\, forwards\, and sidewise\, without getting nowhere near functional regex code (not even venturing into utf8 lands)\, so I gave up.

Is it worth trying copying in the 5.6+Ilya reg*.[ch] and seeing what happens? (Or is that one of the things you tried?)

I'm not saying there's no going back to the old pre-polymorphic (patch #8143) regex code\, I'm just saying that getting there is more pain than I am prepared to face\, especially since what we have now seems to work\, for large enough values of "work".

True.

p5pRT commented 23 years ago

From @jhi

On Sun\, Mar 11\, 2001 at 05​:49​:42PM +0000\, nick@​ing-simmons.net wrote​:

Jarkko Hietaniemi \jhi@​iki\.fi writes​:

Incidentally\, I also took a look at using Ilya's recent simple regex utf8 patches instead of my humongous polymorphism patches. I would at least like to give the new patches a try -- but it looks damned hard to go back after 850 or so patches (about 30 touching directly reg*.[hc])\, especially because of the ongoing other Unicode work\, many of the patches touching reg*.[hc] touch also the Unicode in general\, and vice versa (extra fun supplied by the curious trick that was part of this very bug 20010306.008​: that using things like \w under Unicode requires runtime loading of files and parsing their contents\, using regexen...)

Do Ilya's patches work well in 5.6 branch?

I tried them and they work for almost all the new related tests (Ilya made the patches for the 5.6 branch so no surprise that they work there)\, there were some rough edges still (and I think Ilya mentioned a couple of areas he didn't think would work quite 100% yet).

I ask because regexp/regcomp are a remaining area that needs attention in EBCDIC work. I _think_ regexp stuff is intended to work in Unicode space rather than warped space in the EBCDIC case. But I don't want to waste my efforts figuring out how it works if how it works is going to change.

Whether Ilya's patches end up in the 5.6 branch is Sarathy's call.

I just now tried for several hours to unpatch and patch backwards\, forwards\, and sidewise\, without getting nowhere near functional regex code (not even venturing into utf8 lands)\, so I gave up.

Is it worth trying copying in the 5.6+Ilya reg*.[ch] and seeing what happens? (Or is that one of the things you tried?)

It was one of the many things I tried.

I'm not saying there's no going back to the old pre-polymorphic (patch #8143) regex code\, I'm just saying that getting there is more pain than I am prepared to face\, especially since what we have now seems to work\, for large enough values of "work".

True.

p5pRT commented 23 years ago

From @gsar

On Sun\, 11 Mar 2001 23​:10​:25 CST\, Jarkko Hietaniemi wrote​:

On Sun\, Mar 11\, 2001 at 05​:49​:42PM +0000\, nick@​ing-simmons.net wrote​:

I ask because regexp/regcomp are a remaining area that needs attention in EBCDIC work. I _think_ regexp stuff is intended to work in Unicode space rather than warped space in the EBCDIC case. But I don't want to waste my efforts figuring out how it works if how it works is going to change.

Whether Ilya's patches end up in the 5.6 branch is Sarathy's call.

Given the cited uncertainities and the patch coming as it did very late in the trial phase of 5.6.1\, I'm inclined to put it off until 5.6.2. It's not like the unicode stuff is ever going to be production-worthy in 5.6.x without I/O layers/disciplines.

I might change my mind if someone in the know about the RE engine will convince me that the patch is rock solid\, and won't unduly delay 5.6.1 by needing incremental embellishments of the sort that keep needing more testing.

Sarathy gsar@​ActiveState.com