Perl / perl5

๐Ÿช The Perl programming language
https://dev.perl.org/perl5/
Other
1.84k stars 524 forks source link

tr/...//CU core dumps #1945

Closed p5pRT closed 20 years ago

p5pRT commented 24 years ago

Migrated from rt.perl.org#3215 (status was 'resolved')

Searchable as RT3215$

p5pRT commented 24 years ago

From mschilli@perlmeister.com

Created by mschilli1@aol.com

This is a bug report for perl from mschilli1@​aol.com\, generated with the help of perlbug 1.28 running under perl v5.6.0.

----------------------------------------------------------------- UTF8 support for the tr// operator doesn't seem to work properly. The following snippet\, should\, as advertised in 'perldoc perlunicode'\, convert $string from latin1 to utf8​:

  while (\<>) {   tr/\0-\xff//CU; # latin1 char to utf8   }

It throws two (compile time) warnings​:

  Malformed UTF-8 character at ./t line 4.   Malformed UTF-8 character at ./t line 4.

And the snippet below\, when presented with latin1 chars\, throws a "Segmentation fault (core dumped)"​:

  $latin1 = "Abc รครครครค";   ($utf8 = $latin1) =~ tr/\0-\0177//CU;

Would be great if you guys could take a look.

Thanks\,

-- Mike Schilli

Perl Info ``` Flags: category=core severity=high Site configuration information for perl v5.6.0: Configured by mschilli at Sun Mar 26 23:14:38 PST 2000. Summary of my perl5 (revision 5.0 version 6 subversion 0) configuration: Platform: osname=linux, osvers=2.2.12-20, archname=i686-linux uname='linux www.noevalley.com 2.2.12-20 #1 mon sep 27 10:40:35 edt 1999 i686 unknown ' config_args='-d -D prefix=/home/mschilli/PERL-5.6.0 -e' hint=recommended, useposix=true, d_sigaction=define usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef useperlio=undef d_sfio=undef uselargefiles=define use64bitint=undef use64bitall=undef uselongdouble=undef usesocks=undef Compiler: cc='cc', optimize='-O2', gccversion=egcs-2.91.66 19990314/Linux (egcs-1.1.2 release) cppflags='-fno-strict-aliasing -I/usr/local/include' ccflags ='-fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64' stdchar='char', d_stdstdio=define, usevfork=false intsize=4, longsize=4, ptrsize=4, doublesize=8 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, usemymalloc=n, prototype=define Linker and Libraries: ld='cc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lnsl -lndbm -lgdbm -ldb -ldl -lm -lc -lposix -lcrypt libc=/lib/libc-2.1.2.so, so=so, useshrplib=false, libperl=libperl.a Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic' cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib' Locally applied patches: @INC for perl v5.6.0: /home/mschilli/PERL-5.6.0/lib/perl5/5.6.0/i686-linux /home/mschilli/PERL-5.6.0/lib/perl5/5.6.0 /home/mschilli/PERL-5.6.0/lib/perl5/site_perl/5.6.0/i686-linux /home/mschilli/PERL-5.6.0/lib/perl5/site_perl/5.6.0 /home/mschilli/PERL-5.6.0/lib/perl5/site_perl . Environment for perl v5.6.0: HOME=/home/mschilli LANG=en_US LANGUAGE (unset) LC_ALL=en_US LD_LIBRARY_PATH=/usr/local/lib:/home/mschilli/download/xerces-c_1_0_0-linux/lib LOGDIR (unset) PATH=/usr/local/prod/bin:/home/mschilli/PERL/bin:/home/mschilli/teTeX/bin:/home/cm/bin/Linux:/bin:/usr/bin:/home/cm/bin/ksh:/home/cm/bin/ksh/prm:/home/cm/bin/linux:/usr/local/bin:/bin:/usr/bin:/usr/X11/bin:/usr/andrew/bin:/usr/openwin/bin:/usr/games:.:~/bin:/sbin:/services/bsi/bin:./bin:../bin:/home/mschilli/download/xerces-c_1_0_0-linux/bin:/usr/X11R6/bin:/opt/local/bin:/home/mschilli/INSTALL/framemaker/FM556_linux/bin: PERL_BADLANG (unset) SHELL=/bin/bash ```
p5pRT commented 24 years ago

From @simoncozens

The following snippet\, should\, as advertised in 'perldoc perlunicode'\, convert $string from latin1 to utf8​:

while (\<>) { tr/\0-\xff//CU; # latin1 char to utf8 }

Bleh. Yes\, it should\, but toke.c is incorrectly marking the left hand side of that expression as being a Unicode string; if you say tr/\0-\xff//UC\, it marks it as being non-Unicode. pmtrans actually expects a range of the form "Unicode char255 Unicode" even if if's converting C->U\, Currently\, it only Unicodifies if you're doing UC\, so the right fix is to get toke.c to treat CU as the same as UC and not expand the range but convert the LHS to Unicode.

This does that​:

Inline Patch ```diff --- toke.c~ Mon May 08 14:38:48 2000 +++ toke.c Mon May 08 14:38:29 2000 @@ -1448,7 +1448,7 @@ } } - if (thisutf || uv > 255) { + if (utf || uv > 255) { d = (char*)uv_to_utf8((U8*)d, uv); has_utf = TRUE; } ```

I then tried this: #!/usr/bin/perl -w use Devel​::Peek;

$unistr = v300.202.203; Dump($unistr); ($bytestr=$unistr) =~ tr/\0-\x{ff}//UC; Dump($bytestr); ($unistr2=$bytestr) =~ tr/\0-\xff//CU; Dump($unistr2);

And got​: SV = PV(0xa04142c) at 0xa053c98   REFCNT = 1   FLAGS = (POK\,pPOK\,UTF8)   PV = 0xa048578 "\304\254\303\212\303\213"\0   CUR = 6   LEN = 7 SV = PV(0xa041480) at 0xa058fe0   REFCNT = 1   FLAGS = (POK\,pPOK)   PV = 0xa048550 "\,\312\313"\0   CUR = 3   LEN = 7 SV = PV(0xa04151c) at 0xa06c3b8   REFCNT = 1   FLAGS = (POK\,pPOK)   PV = 0xa0487f0 "\,\303\212\303\213"\0   CUR = 5   LEN = 6

Which is fine apart from the fact that\, amusingly\, tr///CU fails to set Sv_UTF8. This patch fixes that​:

Inline Patch ```diff --- doop.c~ Mon May 08 15:23:34 2000 +++ doop.c Mon May 08 15:24:46 2000 @@ -321,6 +321,7 @@ } *d = '\0'; sv_usepvn_mg(sv, (char*)dst, d - dst); + SvUTF8_on(sv); return matches; } @@ -389,6 +390,7 @@ } *d = '\0'; sv_usepvn_mg(sv, (char*)dst, d - dst); + SvUTF8_on(sv); return matches; } ```

And it now all plays nicely.

I am working on making UTF8 treatment the default and deprecating utf8.pm; demand-loading the tables at the right place is the tricky bit.

And the snippet below\, when presented with latin1 chars\, throws a "Segmentation fault (core dumped)"​:

Yep\, I reported that before. Looks like it's fixed in perl-current.

UTF8 support for the tr// operator doesn't seem to work properly.

Does now. :)

Simon


The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review\, retransmission\, dissemination or other use of\, or taking of any action in reliance upon\, this information by persons or entities other than the intended recipient is prohibited. If you received this in error\, please contact the sender and delete the material from any computer.

p5pRT commented 24 years ago

From @gsar

On Mon\, 08 May 2000 15​:21​:10 +0900\, simon.p.cozens@​jp.pwcglobal.com wrote​:

UTF8 support for the tr// operator doesn't seem to work properly.

Does now. :)

Please note​: Larry wants tr///CU/UC removed entirely rather than fixed\, since it is a rather limiting interface. The intent is to replace it with Unicode​::Map. If you have tuits to help integrating that into the distribution\, let me know.

Sarathy gsar@​ActiveState.com

p5pRT commented 20 years ago

From The RT System itself

The tr///CU feature has been *removed* in 5.7.0\, and will be removed also in 5.6.1 because the interface was a mistake. For similar functionality there is the new pack('U0'\, ...) functionality.