Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.96k stars 555 forks source link

regexp matching against utf-8 strings #5029

Closed p5pRT closed 21 years ago

p5pRT commented 22 years ago

Migrated from rt.perl.org#8516 (status was 'resolved')

Searchable as RT8516$

p5pRT commented 22 years ago

From grossjoh@lothlorien.cs.uni-dortmund.de

Created by Kai.Grossjohann@cs.uni-dortmund.de

Depending on the contents of a string and a regexp\, matching does not always succeed. Please see the following​:

#!/usr/sw/perl/default/bin/perl

my ($s\, $re);

$s = chr(4711) . chr(200) . chr(4711) . chr(200); $re = chr(200) . '.' . chr(200);

if ($s =~ m/$re/) {   print "ok\n"; } else {   print "fail\n"; }

$re .= chr(4711); chop($re);

if ($s =~ m/$re/) {   print "ok\n"; } else {   print "fail\n"; } fail ok

I think that the regular expression matching code should look at the string comprising the regexp and at the string being matched against and make sure that they are both in the same encoding.

In the meantime\, maybe there is a way for me to (efficiently) frob the regular expression and/or string to ensure both are in UTF-8 encoding?

Thanks\, Kai

Perl Info ``` Flags: category=core severity=medium Site configuration information for perl v5.6.1: Configured by goevert at Thu Jul 5 17:09:15 CEST 2001. Summary of my perl5 (revision 5.0 version 6 subversion 1) configuration: Platform: osname=linux, osvers=2.2.19, archname=i686-linux uname='linux schulz 2.2.19 #1 fri may 11 10:50:23 mest 2001 i686 unknown ' config_args='-der -Dcc=gcc -Doptimize=-O3 -Dloclibpth=/usr/sw/wais/default/lib /usr/sw/xml/default/lib /usr/sw/db/default/lib -Dlocincpth=/usr/sw/wais/default/lib /usr/sw/xml/default/include /usr/sw/db/default/include -Dprefix=/usr/sw/perl/5.6.1 -Darchlib=/usr/sw/perl/5.6.1/lib -Dprivlib=/usr/sw/perl/5.6.1/lib -Dsitelib=/usr/sw/perl/5.6.1/lib/site_perl -Dsitearch=/usr/sw/perl/5.6.1/lib/site_perl -Dmydomain=.cs.uni-dortmund.de -Dcf_email=goevert@ls6.cs.uni-dortmund.de -Dperladmin=goevert@ls6.cs.uni-dortmund.de -Ubincompat5005' hint=recommended, useposix=true, d_sigaction=define usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef useperlio=undef d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef Compiler: cc='gcc', ccflags ='-fno-strict-aliasing -I/usr/sw/wais/default/lib -I/usr/sw/xml/default/include -I/usr/sw/db/default/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O3', cppflags='-fno-strict-aliasing -I/usr/sw/wais/default/lib -I/usr/sw/xml/default/include -I/usr/sw/db/default/include' ccversion='', gccversion='2.95.2 20000220 (Debian GNU/Linux)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, usemymalloc=n, prototype=define Linker and Libraries: ld='gcc', ldflags =' -L/usr/sw/wais/default/lib -L/usr/sw/xml/default/lib -L/usr/sw/db/default/lib -L/usr/lib -L/usr/X11R6/lib' libpth=/usr/sw/wais/default/lib /usr/sw/xml/default/lib /usr/sw/db/default/lib /lib /usr/lib /usr/local/lib libs=-lnsl -lgdbm -ldbm -ldb -ldl -lm -lc -lposix -lcrypt -lutil perllibs=-lnsl -ldl -lm -lc -lposix -lcrypt -lutil libc=/lib/libc-2.1.3.so, so=so, useshrplib=false, libperl=libperl.a Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic' cccdlflags='-fpic', lddlflags='-shared -L/usr/sw/wais/default/lib -L/usr/sw/xml/default/lib -L/usr/sw/db/default/lib -L/usr/lib -L/usr/X11R6/lib' Locally applied patches: @INC for perl v5.6.1: /usr/sw/pilot/null.0/lib/site_perl /usr/sw/perl/5.6.1/lib /usr/sw/perl/5.6.1/lib/site_perl /usr/sw/perl/5.6.1/lib/site_perl /usr/sw/perl/5.6.1/lib/site_perl . Environment for perl v5.6.1: HOME=/home-local/grossjoh LANG=C LANGUAGE (unset) LD_LIBRARY_PATH=/usr/sw/pilot/null.0/lib: LOGDIR (unset) PATH=/home-local/grossjoh/bin:/home-local/grossjoh/sw/emacs-21.0/bin:/usr/sw/netscape/6.1:/usr/sw/perl/5.6.1/bin:/usr/sw/xml/null.0/bin:/usr/sw/pilot/null.0/bin:/usr/sw/xtea/1.3/bin:/usr/sw/xcb/2.2/bin:/usr/sw/xlogout/1.1/bin:/usr/sw/wais/2.2.14/bin:/usr/sw/ctwm/3.5.2c/bin:/usr/sw/plan/1.8.4/bin:/usr/sw/xdu/3.0/bin:/usr/sw/finger/1.37/bin:/usr/sw/java/jdk1.3.1/bin:/usr/sw/glimpse/4.13/bin:/usr/sw/xalarm/default/bin:/usr/sw/xpostit/3.3.2/bin:/usr/local/bin:/usr/bin:/usr/sbin:/bin:/sbin:/usr/X11R6/bin:/usr/games:/usr/sbin PERLLIB=/usr/sw/pilot/null.0/lib/site_perl: PERL_BADLANG (unset) SHELL=/opt/local/i06/bin/bash ```
p5pRT commented 22 years ago

From @andk

On Wed\, 13 Feb 2002 15​:37​:11 +0100 (CET)\, grossjoh@​lothlorien.cs.uni-dortmund.de (Kai Grossjohann) said​:

  > This is a bug report for perl from Kai.Grossjohann@​cs.uni-dortmund.de\,   > generated with the help of perlbug 1.33 running under perl v5.6.1.

  > -----------------------------------------------------------------   > [Please enter your report here]

  > Depending on the contents of a string and a regexp\, matching does not   > always succeed. Please see the following​:

  > #!/usr/sw/perl/default/bin/perl

  > my ($s\, $re);

  > $s = chr(4711) . chr(200) . chr(4711) . chr(200);

I suppose you meant

  $s = chr(200) . chr(4711) . chr(200) . chr(4711);

otherwise I cannot confirm your bugreport for any perl. But for the case I presume I can confirm for 5.6.1

This particular bug has been fixed in the current development branch of perl\, somewhere between patch 8130 and 8375. I have the impression it was not a single patch that fixed the bug. Anyway\, this was long before 5.7.1 and 5.7.2 came out. If you can switch to bleadperl (see man perlhack for download locations)\, please do\, it has many UTF-8 bugs fixed and only few bugs open. Otherwise I fear you should avoid everything related to Unicode in 5.6.1.

-- andreas

p5pRT commented 22 years ago

From [Unknown Contact. See original ticket]

andreas.koenig@​anima.de (Andreas J. Koenig) writes​:

This particular bug has been fixed in the current development branch of perl\, somewhere between patch 8130 and 8375. I have the impression it was not a single patch that fixed the bug. Anyway\, this was long before 5.7.1 and 5.7.2 came out. If you can switch to bleadperl (see man perlhack for download locations)\, please do\, it has many UTF-8 bugs fixed and only few bugs open. Otherwise I fear you should avoid everything related to Unicode in 5.6.1.

Thanks a lot for that hint. I wonder whether I can get the local Perl guru to install that...

But it's not absolutely necessary -- I found a way to encode my data which does not rely on Unicode.

It's always great to hear a bug has been fixed :-)

kai -- ~/.signature is​: umop 3p!sdn (Frank Nobis)

p5pRT commented 21 years ago

From @jhi

This bug has been resolved by Perl 5.8.0\, marking the problem ticket as resolved.

p5pRT commented 21 years ago

@jhi - Status changed from 'open' to 'resolved'