Closed p5pRT closed 21 years ago
Same original program as my previous bug report\, different bug (I hope it's a bug ;)\, and I don't even use utf8_on:
use Encode;
$x = "\x{c4}nderung"; $x = encode "utf-8"\, $x; # $x is now utf-8 encoded internally. not that it should matter
$x =~ s/[\x00-\x1f\x80-\x9f]/sprintf "\\x%02x"\, ord $1/ge;
print "$x\n";
This prints:
Ã\x00nderung
But I think it should simply print the utf-8 version of the string. "use utf8" doesn't make a difference\, nor should it make a difference. I still try to reproduce the original problem I wanted to report\, though ;)
Ummm. No.
$bytes = encode(ENCODING\, $string[\, CHECK])
Encodes string from Perl's internal form into ENCODING and returns a sequence of octets. For CHECK see the Handling Malformed Data entry elsewhere in this document.
Note the "returns a sequence of octets". encode() does correctly convert the Latin-1 octet \xc4 into UTF-8 octets 0xc3 0x84\, but it is a sequence of octets\, not marked as UTF-8. Devel::Peek::Dump():
SV = PV(0x140002878) at 0x140013a90 REFCNT = 1 FLAGS = (POK\,pPOK) PV = 0x1400f8d60 "\303\204nderung"\0 CUR = 9 LEN = 10
Hmmm. I can see what you think should happen but that's unfortunately quite not what encode() does. Maybe some new interface is required.
$x =~ s/[\x00-\x1f\x80-\x9f]/sprintf "\\x%02x"\, ord $1/ge;
print "$x\n";
This prints:
Ã\x00nderung
But I think it should simply print the utf-8 version of the string. "use utf8" doesn't make a difference\, nor should it make a difference. I still try to reproduce the original problem I wanted to report\, though ;)
-- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen
Duh. I think the interface you need is decode():
$x = "\x{c4}nderung"; $x = decode("latin1"\, $x);
SV = PV(0x140120c18) at 0x1400b6b80 REFCNT = 1 FLAGS = (POK\,pPOK\,UTF8) PV = 0x140131960 "\303\204nderung"\0 [UTF8 "\x{c4}\x{6e}\x{64}\x{65}\x{72}\x{75}\x{6e}\x{67}"] CUR = 9 LEN = 10
-- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen
Yes\, I realized that immediately after hitting send ;) I also realized that the subject is buggy (there was too much time between the subject and the mail).
I wanted to use decode\, but that wouldn'd have done it\, too. This hapepns because I convert from Convetr::Scalar to Encode\, but the Encode API is cumbersome to use\, as it doesn't (by design ;) specify the internal encoding ;)
More interesting is that the bug shows even on non-utf8-strings. One thingm though\, this bug-reportt was flawed\, too:
$x =~ s/[\x00-\x1f\x80-\x9f]/sprintf "\\x%02x"\, ord $1/ge;
should of course be:
$x =~ s/([\x00-\x1f\x80-\x9f])/sprintf "\\x%02x"\, ord $1/ge;
Now it only shows the behaviour when $x is indeed utf-8 encoded.
-- -----==- | ----==-- _ | ---==---(_)__ __ ____ __ Marc Lehmann +-- --==---/ / _ \/ // /\ \/ / pcg@goof.com |e| -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE --+ The choice of a GNU generation | |
Jarkko Hietaniemi \jhi@​iki\.fi writes:
On Sat\, Dec 01\, 2001 at 11:37:05PM +0100\, Marc Lehmann wrote:
This is a bug report for perl from root@cerebro.laendle\, generated with the help of perlbug 1.33 running under perl v5.7.2.
----------------------------------------------------------------- [Please enter your report here]
Same original program as my previous bug report\, different bug (I hope it's a bug ;)\, and I don't even use utf8_on:
use Encode;
$x = "\x{c4}nderung"; $x = encode "utf-8"\, $x; # $x is now utf-8 encoded internally. not that it should matter
Ummm. No.
$bytes = encode\(ENCODING\, $string\[\, CHECK\]\) Encodes string from Perl's internal form into ENCODING and returns a sequence of octets\. For CHECK see the Handling Malformed Data entry elsewhere in this document\.
Note the "returns a sequence of octets". encode() does correctly convert the Latin-1 octet \xc4 into UTF-8 octets 0xc3 0x84\, but it is a sequence of octets\, not marked as UTF-8. Devel::Peek::Dump():
SV = PV(0x140002878) at 0x140013a90 REFCNT = 1 FLAGS = (POK\,pPOK) PV = 0x1400f8d60 "\303\204nderung"\0 CUR = 9 LEN = 10
Hmmm. I can see what you think should happen but that's unfortunately quite not what encode() does. Maybe some new interface is required.
But a sequence of octets is what he wants?
$x =~ s/[\x00-\x1f\x80-\x9f]/sprintf "\\x%02x"\, ord($1)/ge;
Given that $x is now octets why does s///e not yield what he wants? Because there is no $1 thats why:
What was meant was
$x =~ s/([\x00-\x1f\x80-\x9f])/sprintf "\\x%02x"\, ord($1)/ge;
print "$x\n";
This prints:
Ã\x00nderung
But I think it should simply print the utf-8 version of the string. "use utf8" doesn't make a difference\, nor should it make a difference. I still try to reproduce the original problem I wanted to report\, though ;) -- Nick Ing-Simmons http://www.ni-s.u-net.com/
I think this issue got resolved\, so I'm marking the problem ticket as such.
@jhi - Status changed from 'open' to 'resolved'
Migrated from rt.perl.org#7961 (status was 'resolved')
Searchable as RT7961$