Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.92k stars 549 forks source link

ord returns zero for valid characters #4632

Closed p5pRT closed 21 years ago

p5pRT commented 22 years ago

Migrated from rt.perl.org#7961 (status was 'resolved')

Searchable as RT7961$

p5pRT commented 22 years ago

From root@schmorp.de

Same original program as my previous bug report\, different bug (I hope it's a bug ;)\, and I don't even use utf8_on​:

  use Encode;

  $x = "\x{c4}nderung";   $x = encode "utf-8"\, $x;   # $x is now utf-8 encoded internally. not that it should matter

  $x =~ s/[\x00-\x1f\x80-\x9f]/sprintf "\\x%02x"\, ord $1/ge;

  print "$x\n";

This prints​:

Ã\x00nderung

But I think it should simply print the utf-8 version of the string. "use utf8" doesn't make a difference\, nor should it make a difference. I still try to reproduce the original problem I wanted to report\, though ;)

Perl Info ``` Flags: category=core severity=high Site configuration information for perl v5.7.2: Configured by root at Sat Nov 24 03:47:08 CET 2001. Summary of my perl5 (revision 5.0 version 7 subversion 2 patch 13229) configuration: Platform: osname=linux, osvers=2.4, archname=i686-linux-stdio uname='linux cerebro 2.4.8-ac9 #7 smp thu aug 30 00:15:46 cest 2001 i686 unknown ' config_args='' hint=previous, useposix=true, d_sigaction=define usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef useperlio=undef d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef usemymalloc=y, bincompat5005=undef Compiler: cc='gcc-2.95.4', ccflags ='-fno-strict-aliasing -I/usr/local/include -I/opt/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-g -Os -march=pentium -mcpu=pentium -funroll-loops', cppflags='-fno-strict-aliasing -I/usr/local/include -I/opt/include -fno-strict-aliasing -I/usr/local/include -I/opt/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64' ccversion='', gccversion='2.95.4 20010319 (prerelease)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='gcc-2.95.4', ldflags ='-L/usr/local/lib -L/opt/lib' libpth=/usr/local/lib /lib /usr/lib /opt/lib libs=-ldl -lm -lc -lcrypt perllibs=-ldl -lm -lc -lcrypt libc=/lib/libc-2.2.4.so, so=so, useshrplib=false, libperl=libperl.a Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic' cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib -L/opt/lib' Locally applied patches: DEVEL13323 @INC for perl v5.7.2: /usr/app/lib/perl5 /usr/app/lib/perl5 /usr/app/lib/perl5 /usr/app/lib/perl5 . Environment for perl v5.7.2: HOME=/root LANG (unset) LANGUAGE (unset) LC_CTYPE=de_DE LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/root/s2:/root/s:/opt/qt/bin:/bin:/usr/bin:/usr/app/bin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/usr/app/bin:/usr/app/sbin:/usr/X11/bin:/opt/jdk118/bin:/opt/bin:/opt/sbin:.:/root/cc/dejagnu/bin PERLDB_OPTS=ornaments=0 PERL_BADLANG (unset) SHELL=/bin/bash ```
p5pRT commented 22 years ago

From @jhi

Ummm. No.

  $bytes = encode(ENCODING\, $string[\, CHECK])

  Encodes string from Perl's internal form into ENCODING and returns a   sequence of octets. For CHECK see the Handling Malformed Data entry   elsewhere in this document.

Note the "returns a sequence of octets". encode() does correctly convert the Latin-1 octet \xc4 into UTF-8 octets 0xc3 0x84\, but it is a sequence of octets\, not marked as UTF-8. Devel​::Peek​::Dump()​:

SV = PV(0x140002878) at 0x140013a90   REFCNT = 1   FLAGS = (POK\,pPOK)   PV = 0x1400f8d60 "\303\204nderung"\0   CUR = 9   LEN = 10

Hmmm. I can see what you think should happen but that's unfortunately quite not what encode() does. Maybe some new interface is required.

$x =~ s/[\x00-\x1f\x80-\x9f]/sprintf "\\x%02x"\, ord $1/ge;

print "$x\n";

This prints​:

Ã\x00nderung

But I think it should simply print the utf-8 version of the string. "use utf8" doesn't make a difference\, nor should it make a difference. I still try to reproduce the original problem I wanted to report\, though ;)

-- $jhi++; # http​://www.iki.fi/jhi/   # There is this special biologist word we use for 'stable'.   # It is 'dead'. -- Jack Cohen

p5pRT commented 22 years ago

From @jhi

Duh. I think the interface you need is decode()​:

  $x = "\x{c4}nderung";   $x = decode("latin1"\, $x);

SV = PV(0x140120c18) at 0x1400b6b80   REFCNT = 1   FLAGS = (POK\,pPOK\,UTF8)   PV = 0x140131960 "\303\204nderung"\0 [UTF8 "\x{c4}\x{6e}\x{64}\x{65}\x{72}\x{75}\x{6e}\x{67}"]   CUR = 9   LEN = 10

-- $jhi++; # http​://www.iki.fi/jhi/   # There is this special biologist word we use for 'stable'.   # It is 'dead'. -- Jack Cohen

p5pRT commented 22 years ago

From [Unknown Contact. See original ticket]

Yes\, I realized that immediately after hitting send ;) I also realized that the subject is buggy (there was too much time between the subject and the mail).

I wanted to use decode\, but that wouldn'd have done it\, too. This hapepns because I convert from Convetr​::Scalar to Encode\, but the Encode API is cumbersome to use\, as it doesn't (by design ;) specify the internal encoding ;)

More interesting is that the bug shows even on non-utf8-strings. One thingm though\, this bug-reportt was flawed\, too​:

  $x =~ s/[\x00-\x1f\x80-\x9f]/sprintf "\\x%02x"\, ord $1/ge;

should of course be​:

  $x =~ s/([\x00-\x1f\x80-\x9f])/sprintf "\\x%02x"\, ord $1/ge;

Now it only shows the behaviour when $x is indeed utf-8 encoded.

--   -----==- |   ----==-- _ |   ---==---(_)__ __ ____ __ Marc Lehmann +--   --==---/ / _ \/ // /\ \/ / pcg@​goof.com |e|   -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE --+   The choice of a GNU generation |   |

p5pRT commented 22 years ago

From [Unknown Contact. See original ticket]

Jarkko Hietaniemi \jhi@​iki\.fi writes​:

On Sat\, Dec 01\, 2001 at 11​:37​:05PM +0100\, Marc Lehmann wrote​:

This is a bug report for perl from root@​cerebro.laendle\, generated with the help of perlbug 1.33 running under perl v5.7.2.

----------------------------------------------------------------- [Please enter your report here]

Same original program as my previous bug report\, different bug (I hope it's a bug ;)\, and I don't even use utf8_on​:

use Encode;

$x = "\x{c4}nderung"; $x = encode "utf-8"\, $x; # $x is now utf-8 encoded internally. not that it should matter

Ummm. No.

         $bytes  = encode\(ENCODING\, $string\[\, CHECK\]\)

 Encodes string from Perl's internal form into ENCODING and returns a
 sequence of octets\.  For CHECK see the Handling Malformed Data entry
 elsewhere in this document\.

Note the "returns a sequence of octets". encode() does correctly convert the Latin-1 octet \xc4 into UTF-8 octets 0xc3 0x84\, but it is a sequence of octets\, not marked as UTF-8. Devel​::Peek​::Dump()​:

SV = PV(0x140002878) at 0x140013a90 REFCNT = 1 FLAGS = (POK\,pPOK) PV = 0x1400f8d60 "\303\204nderung"\0 CUR = 9 LEN = 10

Hmmm. I can see what you think should happen but that's unfortunately quite not what encode() does. Maybe some new interface is required.

But a sequence of octets is what he wants?

$x =~ s/[\x00-\x1f\x80-\x9f]/sprintf "\\x%02x"\, ord($1)/ge;

Given that $x is now octets why does s///e not yield what he wants? Because there is no $1 thats why​:

What was meant was

  $x =~ s/([\x00-\x1f\x80-\x9f])/sprintf "\\x%02x"\, ord($1)/ge;

print "$x\n";

This prints​:

Ã\x00nderung

But I think it should simply print the utf-8 version of the string. "use utf8" doesn't make a difference\, nor should it make a difference. I still try to reproduce the original problem I wanted to report\, though ;) -- Nick Ing-Simmons http​://www.ni-s.u-net.com/

p5pRT commented 21 years ago

From @jhi

I think this issue got resolved\, so I'm marking the problem ticket as such.

p5pRT commented 21 years ago

@jhi - Status changed from 'open' to 'resolved'