ord returns zero for valid characters

p5pRT commented 22 years ago

Migrated from rt.perl.org#7961 (status was 'resolved')

Searchable as RT7961$

p5pRT commented 22 years ago

From root@schmorp.de

Same original program as my previous bug report\, different bug (I hope it's a bug ;)\, and I don't even use utf8_on:

use Encode;

$x = "\x{c4}nderung"; $x = encode "utf-8"\, $x; # $x is now utf-8 encoded internally. not that it should matter

$x =~ s/[\x00-\x1f\x80-\x9f]/sprintf "\\x%02x"\, ord $1/ge;

print "$x\n";

This prints:

Ã\x00nderung

But I think it should simply print the utf-8 version of the string. "use utf8" doesn't make a difference\, nor should it make a difference. I still try to reproduce the original problem I wanted to report\, though ;)

Perl Info

``` Flags: category=core severity=high Site configuration information for perl v5.7.2: Configured by root at Sat Nov 24 03:47:08 CET 2001. Summary of my perl5 (revision 5.0 version 7 subversion 2 patch 13229) configuration: Platform: osname=linux, osvers=2.4, archname=i686-linux-stdio uname='linux cerebro 2.4.8-ac9 #7 smp thu aug 30 00:15:46 cest 2001 i686 unknown ' config_args='' hint=previous, useposix=true, d_sigaction=define usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef useperlio=undef d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef usemymalloc=y, bincompat5005=undef Compiler: cc='gcc-2.95.4', ccflags ='-fno-strict-aliasing -I/usr/local/include -I/opt/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-g -Os -march=pentium -mcpu=pentium -funroll-loops', cppflags='-fno-strict-aliasing -I/usr/local/include -I/opt/include -fno-strict-aliasing -I/usr/local/include -I/opt/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64' ccversion='', gccversion='2.95.4 20010319 (prerelease)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='gcc-2.95.4', ldflags ='-L/usr/local/lib -L/opt/lib' libpth=/usr/local/lib /lib /usr/lib /opt/lib libs=-ldl -lm -lc -lcrypt perllibs=-ldl -lm -lc -lcrypt libc=/lib/libc-2.2.4.so, so=so, useshrplib=false, libperl=libperl.a Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic' cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib -L/opt/lib' Locally applied patches: DEVEL13323 @INC for perl v5.7.2: /usr/app/lib/perl5 /usr/app/lib/perl5 /usr/app/lib/perl5 /usr/app/lib/perl5 . Environment for perl v5.7.2: HOME=/root LANG (unset) LANGUAGE (unset) LC_CTYPE=de_DE LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/root/s2:/root/s:/opt/qt/bin:/bin:/usr/bin:/usr/app/bin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/usr/app/bin:/usr/app/sbin:/usr/X11/bin:/opt/jdk118/bin:/opt/bin:/opt/sbin:.:/root/cc/dejagnu/bin PERLDB_OPTS=ornaments=0 PERL_BADLANG (unset) SHELL=/bin/bash ```

p5pRT commented 22 years ago

From @jhi

Ummm. No.

$bytes = encode(ENCODING\, $string[\, CHECK])

Encodes string from Perl's internal form into ENCODING and returns a sequence of octets. For CHECK see the Handling Malformed Data entry elsewhere in this document.

Note the "returns a sequence of octets". encode() does correctly convert the Latin-1 octet \xc4 into UTF-8 octets 0xc3 0x84\, but it is a sequence of octets\, not marked as UTF-8. Devel::Peek::Dump():

SV = PV(0x140002878) at 0x140013a90 REFCNT = 1 FLAGS = (POK\,pPOK) PV = 0x1400f8d60 "\303\204nderung"\0 CUR = 9 LEN = 10

Hmmm. I can see what you think should happen but that's unfortunately quite not what encode() does. Maybe some new interface is required.

$x =~ s/[\x00-\x1f\x80-\x9f]/sprintf "\\x%02x"\, ord $1/ge;

print "$x\n";

This prints:

Ã\x00nderung

But I think it should simply print the utf-8 version of the string. "use utf8" doesn't make a difference\, nor should it make a difference. I still try to reproduce the original problem I wanted to report\, though ;)

-- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen

p5pRT commented 22 years ago

From @jhi

Duh. I think the interface you need is decode():

$x = "\x{c4}nderung"; $x = decode("latin1"\, $x);

SV = PV(0x140120c18) at 0x1400b6b80 REFCNT = 1 FLAGS = (POK\,pPOK\,UTF8) PV = 0x140131960 "\303\204nderung"\0 [UTF8 "\x{c4}\x{6e}\x{64}\x{65}\x{72}\x{75}\x{6e}\x{67}"] CUR = 9 LEN = 10

-- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen

p5pRT commented 22 years ago

From [Unknown Contact. See original ticket]

Yes\, I realized that immediately after hitting send ;) I also realized that the subject is buggy (there was too much time between the subject and the mail).

I wanted to use decode\, but that wouldn'd have done it\, too. This hapepns because I convert from Convetr::Scalar to Encode\, but the Encode API is cumbersome to use\, as it doesn't (by design ;) specify the internal encoding ;)

More interesting is that the bug shows even on non-utf8-strings. One thingm though\, this bug-reportt was flawed\, too:

$x =~ s/[\x00-\x1f\x80-\x9f]/sprintf "\\x%02x"\, ord $1/ge;

should of course be:

$x =~ s/([\x00-\x1f\x80-\x9f])/sprintf "\\x%02x"\, ord $1/ge;

Now it only shows the behaviour when $x is indeed utf-8 encoded.

-- -----==- | ----==-- _ | ---==---(_)__ __ ____ __ Marc Lehmann +-- --==---/ / _ \/ // /\ \/ / pcg@goof.com |e| -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE --+ The choice of a GNU generation | |

p5pRT commented 22 years ago

From [Unknown Contact. See original ticket]

Jarkko Hietaniemi \jhi@iki\.fi writes:

On Sat\, Dec 01\, 2001 at 11:37:05PM +0100\, Marc Lehmann wrote:

This is a bug report for perl from root@cerebro.laendle\, generated with the help of perlbug 1.33 running under perl v5.7.2.

----------------------------------------------------------------- [Please enter your report here]

Same original program as my previous bug report\, different bug (I hope it's a bug ;)\, and I don't even use utf8_on:

use Encode;

$x = "\x{c4}nderung"; $x = encode "utf-8"\, $x; # $x is now utf-8 encoded internally. not that it should matter

Ummm. No.
         $bytes  = encode$ENCODING\, $string\[\, CHECK\]$

 Encodes string from Perl's internal form into ENCODING and returns a
 sequence of octets\.  For CHECK see the Handling Malformed Data entry
 elsewhere in this document\.
Note the "returns a sequence of octets". encode() does correctly convert the Latin-1 octet \xc4 into UTF-8 octets 0xc3 0x84\, but it is a sequence of octets\, not marked as UTF-8. Devel::Peek::Dump():

SV = PV(0x140002878) at 0x140013a90 REFCNT = 1 FLAGS = (POK\,pPOK) PV = 0x1400f8d60 "\303\204nderung"\0 CUR = 9 LEN = 10

Hmmm. I can see what you think should happen but that's unfortunately quite not what encode() does. Maybe some new interface is required.

But a sequence of octets is what he wants?

$x =~ s/[\x00-\x1f\x80-\x9f]/sprintf "\\x%02x"\, ord($1)/ge;

Given that $x is now octets why does s///e not yield what he wants? Because there is no $1 thats why:

What was meant was

$x =~ s/([\x00-\x1f\x80-\x9f])/sprintf "\\x%02x"\, ord($1)/ge;

print "$x\n";

This prints:

Ã\x00nderung

But I think it should simply print the utf-8 version of the string. "use utf8" doesn't make a difference\, nor should it make a difference. I still try to reproduce the original problem I wanted to report\, though ;) -- Nick Ing-Simmons http://www.ni-s.u-net.com/

p5pRT commented 21 years ago

From @jhi

I think this issue got resolved\, so I'm marking the problem ticket as such.

p5pRT commented 21 years ago

@jhi - Status changed from 'open' to 'resolved'

Perl / perl5