Closed p5pRT closed 20 years ago
This is a bug report for perl from mschilli1@aol.com\, generated with the help of perlbug 1.28 running under perl v5.6.0.
----------------------------------------------------------------- UTF8 support for the tr// operator doesn't seem to work properly. The following snippet\, should\, as advertised in 'perldoc perlunicode'\, convert $string from latin1 to utf8:
while (\<>) { tr/\0-\xff//CU; # latin1 char to utf8 }
It throws two (compile time) warnings:
Malformed UTF-8 character at ./t line 4. Malformed UTF-8 character at ./t line 4.
And the snippet below\, when presented with latin1 chars\, throws a "Segmentation fault (core dumped)":
$latin1 = "Abc รครครครค"; ($utf8 = $latin1) =~ tr/\0-\0177//CU;
Would be great if you guys could take a look.
Thanks\,
-- Mike Schilli
The following snippet\, should\, as advertised in 'perldoc perlunicode'\, convert $string from latin1 to utf8:
while (\<>) { tr/\0-\xff//CU; # latin1 char to utf8 }
Bleh. Yes\, it should\, but toke.c is incorrectly marking the left hand side of that expression as being a Unicode string; if you say tr/\0-\xff//UC\, it marks it as being non-Unicode. pmtrans actually expects a range of the form "Unicode char255 Unicode" even if if's converting C->U\, Currently\, it only Unicodifies if you're doing UC\, so the right fix is to get toke.c to treat CU as the same as UC and not expand the range but convert the LHS to Unicode.
This does that:
I then tried this: #!/usr/bin/perl -w use Devel::Peek;
$unistr = v300.202.203; Dump($unistr); ($bytestr=$unistr) =~ tr/\0-\x{ff}//UC; Dump($bytestr); ($unistr2=$bytestr) =~ tr/\0-\xff//CU; Dump($unistr2);
And got: SV = PV(0xa04142c) at 0xa053c98 REFCNT = 1 FLAGS = (POK\,pPOK\,UTF8) PV = 0xa048578 "\304\254\303\212\303\213"\0 CUR = 6 LEN = 7 SV = PV(0xa041480) at 0xa058fe0 REFCNT = 1 FLAGS = (POK\,pPOK) PV = 0xa048550 "\,\312\313"\0 CUR = 3 LEN = 7 SV = PV(0xa04151c) at 0xa06c3b8 REFCNT = 1 FLAGS = (POK\,pPOK) PV = 0xa0487f0 "\,\303\212\303\213"\0 CUR = 5 LEN = 6
Which is fine apart from the fact that\, amusingly\, tr///CU fails to set Sv_UTF8. This patch fixes that:
And it now all plays nicely.
I am working on making UTF8 treatment the default and deprecating utf8.pm; demand-loading the tables at the right place is the tricky bit.
And the snippet below\, when presented with latin1 chars\, throws a "Segmentation fault (core dumped)":
Yep\, I reported that before. Looks like it's fixed in perl-current.
UTF8 support for the tr// operator doesn't seem to work properly.
Does now. :)
Simon
The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review\, retransmission\, dissemination or other use of\, or taking of any action in reliance upon\, this information by persons or entities other than the intended recipient is prohibited. If you received this in error\, please contact the sender and delete the material from any computer.
On Mon\, 08 May 2000 15:21:10 +0900\, simon.p.cozens@jp.pwcglobal.com wrote:
UTF8 support for the tr// operator doesn't seem to work properly.
Does now. :)
Please note: Larry wants tr///CU/UC removed entirely rather than fixed\, since it is a rather limiting interface. The intent is to replace it with Unicode::Map. If you have tuits to help integrating that into the distribution\, let me know.
Sarathy gsar@ActiveState.com
The tr///CU feature has been *removed* in 5.7.0\, and will be removed also in 5.6.1 because the interface was a mistake. For similar functionality there is the new pack('U0'\, ...) functionality.
Migrated from rt.perl.org#3215 (status was 'resolved')
Searchable as RT3215$