decode_utf8 sets utf8 flag on plain ascii strings

Perl / perl5

🐪 The Perl programming language

https://dev.perl.org/perl5/

Other

1.86k stars 527 forks source link

decode_utf8 sets utf8 flag on plain ascii strings #8779

Closed p5pRT closed 12 years ago

p5pRT commented 17 years ago

Migrated from rt.perl.org#41527 (status was 'rejected')

Searchable as RT41527$

p5pRT commented 17 years ago

From schmorp@schmorp.de

On Sat\, Mar 31\, 2007 at 01:27:14AM +0200\, Juerd Waalboer \juerd@convolution\.nl wrote:

If a downgrade is "needed"\, it means that your byte string was accidentally upgraded. This should only happen if you mix it with a text string. If it happens without mixing it with a text string\, that is a bug. Please report.

Thats extrenely far from reality. Lots of things can cause a text string to be upgraded. Forcing people to learn all that is just stupid when you could just make it work logically without telling people about internals (note that the internals come into play by your peculiar efinition of "tetx strings" having the UTF-X bit set\, which isn't reality and in my opinion is an extremely stupid limitation that 96% of perl does not follow).

Instead: "your code is broken\, don't mix text strings with byte strings" or "it is a bug in perl that your string got upgraded in the first place."

See my json example. Nothing gets mixed.

Exactly. But "C" somehow works on UTF-8\, while it shouldn't.

Agreed!

Things that specifically handle bytes\, and bytes only\, should DIE (or at least warn) when used with a string that has the UTF-8 flag on.

So you force people to know about the internal flag\, lest they cannot avoid the die.

This completely contradicts your claim that you want to abstratc the UTF-X flag away from the Perl level.

still lets users get away with naively assuming that byte == character for latin1 strings\, as designed\, but at least catches the cases when you know that the user does something stupid.

But the user does not do anythign stupid when feeding binary strings (my definition\, indices 0..255) into Compress::Zlib. It is only your request for a die that makes problems. Zlib would work just fine if perl gave downgraded data to perl and XS code that wants it.

It should work on characters\, as documented (just like in C\, char array[]; array[i] is one character\, regardless of how many bits a character in C has\, or how it is encoded).

A C "char" is a byte\, not a multibyte character\, ever.

Exactly. The same as in Perl I would assume\, as Perl uses characters to store bytes\, it doesn't use multibyte characters on the Perl level.

Hope you get it this time :)

Besides that\, the "C" in Perl's pack() is documented as a single byte.

"A C "char" is a byte".

Your words.

But here you say a byte is not a character. Thats a contradiction.

You are deeply confusing the internal encoding Perl uses (Which might be single octets for characters\, or UTF-X encoded octets\, for characters) with the language proper.

In C\, a single byte is a character\, even if it happens to have a value higher than 255 (although very few compilers allow that\, usually\, a byte is an octet\, although it is common on DSPs to have 32 bit bytes).

Even if Perl encoded a single character into multiple C bytes/octets\, that does not mean its more than a single character.

The documentation is completely contradictory when it comes to "C" and can easily be interpreted to mean a single character in the C sense.

Fact is "even under Unicode" it doesn't work as advertised\, becasue Unicode can be internally represented in multiple ways in Perl.

I think that "char value" should be either removed from perlfunc\, or explained in more detail. It's NOT OBVIOUS to those who don't know C.

To those who do know C it has perfectly clear meaning\, namely a single character.

The earlier Perl versions didn't support character values greater than 255\, and if you never have those characters\, C still works perfectly.

Nothing in C limits you to 256 characters. A byte in C is exactly a character. It can store at least 256 different values\, but nothing in C limits you to that\, many compilers use larger bytes. And the same is true in Perl: Perl only supported bytes 0..255 in earlier versiosn\, and now the perl byte can be up to 64 bits (or maybe a bit less\, I forgot).

But yes\, if you're dealing with characters and want your program to be able to handle those fancy new >255 characters\, you should change that C to a U.

I do not want to handle those fancy >255 characters. I only want to handle a single octet. But unpack doesn't do that.

In fact\, thats thr problem: all old code that uses unpack "C" would need to be changed to use "U". Thats the compatibility breakage I was talking about. Code that uses "C" expects the single-octet meaning form perl 5.005\, it does not expect the "sometimes returns half of a utf-x encoded character\, sometimes not" meaning it has in current perls.

It is especially weird as it suddenly has become incompatible with regards to the other template characters such as "n"\, which correctly decode bytes regardless of internal encoding.

Besides\, perl 5.8 does not follow that description: perl -e '$x = "\xc3\xbc"; die unpack "U*"\, $x' This gives me 195188\, two characters\, although it is a single UTF-8 character\, so why does it wrongly give me two? $x certainly is utf-8-encoded (try Encode::encode_utf8 chr 252\, it results in the above string).

You asked for the codepoints U+00C3 and U+00BC\, and got them.

No\, I asked for UTF-8 encoded characters. Again\, read the documentation:

* If the pattern begins with a "U"\, the resulting string will * be treated as UTF-8-encoded Unicode.

thats for pack\, unfortunately.

U A Unicode character number. Encodes to UTF-8 internally

uh\, that internal thing again. So how many characters will pack "U"\, 200 give me? According to the documentation\, 2\, as UTF-8 requires that. That is not what happens\, though.

Thats the problem. Perfectly working code using unpack "CN" suddenly stops working because "N" works on bytes\, while "C" works on the internal encoding\, regardless of what that might be.

It's a UTF-8 encoded byte string\, alright\, but "U" is for Unicode\, not UTF-8.

You cna store unicode in UTF-8. IF you say "UTF-8 encoded unicode" then you very well have UTF-8\, even though it still is unicode.

Ok\, so I will tell people to replace "C" by "U" in theor code then.

If they do Unicode text strings\, that's indeed very good advice.

Unfortunately\, thats what they have to do when dealing with binary strings\, as C doesn't work on them.

But you still want C for byte strings\, simply because some protocols or formats expect a byte value. :)

Exactly. And then I have to use "U" to get it. Because a byte in perl is a character. Is and always has been\, just as in C.

And to get those bytes for use in such protocols you have to use "U" now\, instead of "C" as in earlier versions.

Right\, while the documentation on unpack "U" disagrees with it\, as it talks about UTF-8.

That would be a bug\, but I can't find it in my copy (5.8.8). It only says "Encodes to UTF-8 internally" for pack()\, which as far as I can tell\, is true.

So it talks about using UTF-8\, so\, according to you\, it is a bug. Fine with me.

-- The choice of a -----==- _GNU_ ----==-- _ generation Marc Lehmann ---==---(_)__ __ ____ __ pcg@goof.com --==---/ / _ \/ // /\ \/ / http://schmorp.de/ -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE

p5pRT commented 17 years ago

From @Juerd

Marvin Humphrey skribis 2007-03-30 16:06 (-0700):

I strongly disagree with this assessment. In particular\, I think
insisting that the user be responsible for manually segregating
character and byte-oriented data without any help from Perl is
totally unreasonable.

That is okay. You are not alone.

In fact\, I would also like to have real types.

But Perl has had its current model for quite a while. If you don't agree\, there are a few things that you can do:

1. Find a way to do it better\, in a backwards compatible way\, and then either 1a. implement it yourself 1b. document it and hope that someone else has the tuits

2. Find a way to do it better\, in a non-backwards compatible way\, and then either 2a. fork perl and implement it yourself 2b. document it and hope that someone else has the tuits

3. Just use the tools that Perl currently provides.

Any of these are totally valid options. Gerard Goossen\, for example\, has picked 2a. I picked 3. Which one is your favourite?

I hope that Perl 6 does not opt to replicate Perl 5's behavior in this area (my understanding is that it will not\, but I'm not following development closely).

You are right. Perl 6 will have distinct byte string types ("buf" for "buffer")\, and character string types ("str" for "string").

I guess Perl 6 follows the 2nd path\, with both a and b simultaneously :)

How about encouraging the use of encoding::warnings in perlunitut?

See my other post about why this module is not what you want. -- korajn salutojn\,

juerd waalboer: perl hacker \juerd@juerd\.nl \<http://juerd.nl/sig> convolution: ict solutions and consultancy \sales@convolution\.nl

Ik vertrouw stemcomputers niet. Zie \<http://www.wijvertrouwenstemcomputersniet.nl/>.

p5pRT commented 17 years ago

From nospam-abuse@bloodgate.com

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Moin\,

On Friday 30 March 2007 23:06:47 Marvin Humphrey wrote:

On Mar 30\, 2007\, at 2:25 PM\, Juerd Waalboer wrote:

That so many users\, including those as expert as Marc\, possess a "broken" understanding of Perl's Unicode model suggests a flawed design. I think the design is solid\, but the implementation (see regex) slightly broken and documentation wildly misleading.

I strongly disagree with this assessment. In particular\, I think insisting that the user be responsible for manually segregating character and byte-oriented data without any help from Perl is totally unreasonable.

Look at how easily Marc made the "mistake" of commingling the two types of data. It's debatable whether the fact that Perl allowed him to do that without complaint is a flaw with the design or the implementation\, but it's one or the other and it's serious.

Additionally\, as Marc points out\, there are lots of broken XS modules out there -- including one of mine. (KinoSearch 0.15 -- Unicode support is fixed as of 0.20_01\, which breaks backwards compatibility.) Few or none of them would be broken if Perl made it more difficult to move between character data and byte-oriented data -- errors would be flying right and left and the broken modules would get fixed right away.

Of course I understand why that cannot be the case\, but it's astonishing to me that you see this as a problem which can be solved via documentation.

I think just documenting isn't enough. We do have things like "strict"\, so if the current Perl model doesn't allow you to even detect when you mix the wrong kind of data\, then we need module/pragma that catches these errors.

Of course warnings::encode exists\, but it seems to not be able to distinguish between "untagged" data and real ISO-8859-1 strings as Perl itself doesn't make this distinction.

How about encouraging the use of encoding::warnings in perlunitut?

How about adding it to core and having 'use 5.10;' turn it on?

If I understand correctly\, that would not be enough due to the "is this binary or really iso-8859-1 encoded data" problem mentioned above.

all the best\,

tels

- -- Signed on Sat Mar 31 01:42:47 2007 with key 0x93B84C15. View my photo gallery: http://bloodgate.com/photos PGP key on http://bloodgate.com/tels.asc or per email.

"In 1988\, Jack Thompson ran against Janet Reno for DA of Dade County: Thompson's unique campaign message was that Reno was unfit for the job because\, as a closeted lesbian with a drinking problem\, she was great candidate for blackmail by the criminal element. Jack never explained why this remained a threat even after he exposed her 'secret'. Reno cruised at the polls."

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux)

iQEVAwUBRg29jncLPEOTuEwVAQJALAf/SsSjz5VB4l3Zcggd18SNmdTq8DpBLUtP pxiPCs0fYrEtDny/HvDCbQss/nEaGmFwPaVpAA+kFp8jss3h3xzklW6MwAm7Aisy +EiZO0JEcADXRWr9CChJpWfMr0qllmzsUUKHa6wc9iXagD6kPoiL49Ay5bkqPBDT OKOfcJIRDqk12VKATpdQlBIHR3cEpnUMdh8QKhmAArkXAsV5cZGBC9EGm8l+dgeK Uc2k7pxvLXdjCZu6YbJfPwwdiLlugL23Bci7sZrCO/JyboBOK3ch5dWYohZ8QoMw SahL/axgJ1DeFTP2ryL6wvnM1djF+HSbzoaLD1E+d7XJqB700Qxdfg== =eI9w -----END PGP SIGNATURE-----

p5pRT commented 17 years ago

From @Juerd

Marc Lehmann skribis 2007-03-31 1:33 (+0200):

The difference between us\, and thats what it boils down to\, is that you give the internal UTF-X bit meaning. You equate UTF-X flag set == Unicode string.

No\, that's a unidirectional thing.

I've said it on p5p at least a dozen times\, but I'll say it again:

If the UTF8 flag is set\, you can be sure that you have a text string. If the UTF8 flag is not set\, it can be either a byte string or a text string.

If you have a text string\, the UTF8 flag may or not be set. If you have a byte string\, the UTF8 string is not set (or it was set because you treated the byte string as a text string).

The problem with your approach is that you have to expose the UTF-X flag to users. Which comes with a lot of problems.

Again: you're kidding\, right?

I'm constantly very explicitly and verbosely telling people to NOT look at the flag\, NOT set it manually\, etcetera.

Heck\, I've even explained that I think you should try to (pretend to) be ignorant about the internals\, in response to your message even!

I do not understand how you are able to misinterpret this message even after this many posts in this thread alone. Have you ever read perlunitut\, even?

Initially I thought you\, too\, wanted a unicode model where the UTF-X bit is not exposed to the perl level. But in fact the opposite is true: you forc> knowledge of the UTF-X bit on users\, even though it should be transparent. ... the problem is you want them to track the UTF-X flag in addition to that. ... Then why do you want to force people to know about how 128..255 is encoded internally then?

That's not what I said\, nor what I meant. In fact\, quite the opposite.

If you're just spending this evening just to get on my nerves\, then congratulations!

Oh\, but they do. Please read perlunitut\, which tries to redefine the universe into four important definitions (and succeeds). I do not have that manpage.

http://www.google.com/search?q=perlunitut&btnI=I'm+Feeling+Lucky

Because "internal format" strings can store binary data just as well\, and often does.

Yes\, and when you use such a byte string as a text string\, its bytes are considered to be codepoints\, just like in latin1.

I am talking purely about the perl level strings. If perlunitut confused the issue by talking about internal encoding it completely failed its mission\, imho.

I strongly suggest that you READ the document before whining about its supposed failure.

The problem is that some parts of perl make a difference bewteen the very same string\, depending on how it is encoded internally\, _even if the encoding is the same on the Perl level_.

Those are bugs. Report them\, and they might get fixed.

utf8::encode is a text operation. It will assume that whatever you give it\, is a text string. Its characters are considered Unicode codepoints. Where does it say so?

Well\, you have already denied that "encoding is going from characters to bytes" is a real world fact\, so I guess there's little point in pointing out the places where exactly the same thing is explained.

you need to know some internals. Wrong. I need know no internals

A certain Marc Lehmann once said:

"I would love if that were the case\, but the powers to be decided that every perl progarmmer has to know those internals\, and needs to be able to deal with them."

That makes no sense\, because UTF-8 is a means of representing characters. Byte strings consist of bytes\, not characters. Not in C\, which is what the documentation constantly refers to\, mind you.

And that is bad\, I agree. Perl programmers should not be expected to speak C in order to understand Perl documentation. This is a big problem in Perl's documentation\, but who's going to fix it? -- korajn salutojn\,

juerd waalboer: perl hacker \juerd@juerd\.nl \<http://juerd.nl/sig> convolution: ict solutions and consultancy \sales@convolution\.nl

Ik vertrouw stemcomputers niet. Zie \<http://www.wijvertrouwenstemcomputersniet.nl/>.

p5pRT commented 17 years ago

From schmorp@schmorp.de

On Sat\, Mar 31\, 2007 at 12:38:19AM +0200\, Juerd Waalboer \juerd@convolution\.nl wrote:

codepoints map to the same byte values. Except it are different byte values :)

I said "unicode encoding"\, but should have said "unicode codepoints".

Codepoints 0..256 in latin1 map to byte values 0..256. That makes it special.

Yes\, and the exact same is true for unicode (both have a 1-1 mapping between 0..255 and octets)\, trivially\, of course\, as unicode explicitly is a superset of latin1.

-- The choice of a -----==- _GNU_ ----==-- _ generation Marc Lehmann ---==---(_)__ __ ____ __ pcg@goof.com --==---/ / _ \/ // /\ \/ / http://schmorp.de/ -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE

p5pRT commented 17 years ago

From @Juerd

Marc Lehmann skribis 2007-03-31 1:53 (+0200):

So you force people to know about the internal flag\, lest they cannot avoid the die.

No\, you don't have to know about the UTF8 flag\, just that Perl can't always know if your string is a text string\, but is there to help you when it does.

Besides that\, the "C" in Perl's pack() is documented as a single byte. "A C "char" is a byte". Your words. But here you say a byte is not a character. Thats a contradiction.

"C char" ne "Perl character".

No\, I asked for UTF-8 encoded characters. Again\, read the documentation: * If the pattern begins with a "U"\, the resulting string will * be treated as UTF-8-encoded Unicode.

Resulting string\, not input string.

The word "internally" is missing here. I will do my best to correct that.

thats for pack\, unfortunately. U A Unicode character number. Encodes to UTF-8 internally uh\, that internal thing again. So how many characters will pack "U"\, 200 give me? According to the documentation\, 2\, as UTF-8 requires that.

One character. Note again that "character" isn't the same as a "C char". We in Perl land\, and the people over in Unicode land\, use different words\, sometimes.

Most of the time\, a Perl "character" means codepoint.

Right\, while the documentation on unpack "U" disagrees with it\, as it talks about UTF-8. That would be a bug\, but I can't find it in my copy (5.8.8). It only says "Encodes to UTF-8 internally" for pack()\, which as far as I can tell\, is true. So it talks about using UTF-8\, so\, according to you\, it is a bug. Fine with me.

This was for pack\, you were talking about unpack. Also\, the word "internally" was probably not added without reason. -- korajn salutojn\,

juerd waalboer: perl hacker \juerd@juerd\.nl \<http://juerd.nl/sig> convolution: ict solutions and consultancy \sales@convolution\.nl

Ik vertrouw stemcomputers niet. Zie \<http://www.wijvertrouwenstemcomputersniet.nl/>.

p5pRT commented 17 years ago

From @Juerd

Tels skribis 2007-03-31 1:46 (+0000):

How about encouraging the use of encoding::warnings in perlunitut? How about adding it to core and having 'use 5.10;' turn it on? If I understand correctly\, that would not be enough due to the "is this binary or really iso-8859-1 encoded data" problem mentioned above.

You understand correctly :)

Thanks. -- korajn salutojn\,

juerd waalboer: perl hacker \juerd@juerd\.nl \<http://juerd.nl/sig> convolution: ict solutions and consultancy \sales@convolution\.nl

Ik vertrouw stemcomputers niet. Zie \<http://www.wijvertrouwenstemcomputersniet.nl/>.

p5pRT commented 17 years ago

From nospam-abuse@bloodgate.com

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Moin\,

On Saturday 31 March 2007 00:04:53 Juerd Waalboer wrote:

Marc Lehmann skribis 2007-03-31 1:33 (+0200):

The difference between us\, and thats what it boils down to\, is that you give the internal UTF-X bit meaning. You equate UTF-X flag set == Unicode string.

No\, that's a unidirectional thing.

I've said it on p5p at least a dozen times\, but I'll say it again:

If the UTF8 flag is set\, you can be sure that you have a text string. If the UTF8 flag is not set\, it can be either a byte string or a text string.

If you have a text string\, the UTF8 flag may or not be set.

So you are basically saying that you can have any string (text or byte) with either the flag set\, or not. Er\, and how do we find out which combination is which?

I think we all should go to bed and have a nice rest. What you wrote above makes no sense at all to me now anymore.

So\, good night for now\,

Te"pls don't tell mom it's 2 already"ls

- -- Signed on Sat Mar 31 02:12:15 2007 with key 0x93B84C15. Get one of my photo posters: http://bloodgate.com/posters PGP key on http://bloodgate.com/tels.asc or per email.

Like my code? Want to hire me to write some code for you? Send email!

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux)

iQEVAwUBRg3ET3cLPEOTuEwVAQJ7QQf/QmX+IUIaVxgJMSfCrGFnQDRlXzKEHXBk fIsz1cCNmwPeRJsskLxxkRsC2TlufgccRx3RSN0HcI56l79ldBAvN7uqNgRHEZ2x JRsIFdT6B13YPFwjAsnSNwl9kIYoRmaXVsFugQELqIbKAKqe/7BGCgnG9qLfN8a0 n6+T3tbpoyWL5MWcDGi6Z+r+GL3bb3GQQQY9GHa4sNU5aWsDcdEOTM9g9KKgINY1 0OIt5nXxPjLEcpOsuqxFA/Xk9kA/EPr/oz4VpZN+9WlahBkL31BJ5Vb3QjbC6eo5 amOAJ+qg04jFu2rLTMBtjunc+/Hvebiz8JsK1Bcb5VeG3GEJKKRTRw== =2fEH -----END PGP SIGNATURE-----

p5pRT commented 17 years ago

From @Juerd

Marc Lehmann skribis 2007-03-31 2:12 (+0200):

Yes\, and the exact same is true for unicode (both have a 1-1 mapping between 0..255 and octets)\, trivially\, of course\, as unicode explicitly is a superset of latin1.

Unicode is a character set\, not a character encoding.

While for 8 bit character sets\, the encoding is the same thing\, once you get past the 8 bit boundary\, the difference begins to matter.

A unicode string is a sequence of codepoints\, not octets. They don't map 1:1 to octets either. To express a unicode string in octects\, you need to encode it. For this\, there are several possibilities\, including UTF-8\, UTF-16\, ...

Unicode is a superset of the latin1 character set\, not the latin1 character encoding. We'd need bigger bytes for the latter :) -- korajn salutojn\,

juerd waalboer: perl hacker \juerd@juerd\.nl \<http://juerd.nl/sig> convolution: ict solutions and consultancy \sales@convolution\.nl

Ik vertrouw stemcomputers niet. Zie \<http://www.wijvertrouwenstemcomputersniet.nl/>.

p5pRT commented 17 years ago

From schmorp@schmorp.de

On Sat\, Mar 31\, 2007 at 12:19:16AM +0000\, Tels \nospam\-abuse@bloodgate\.com wrote:

Anyway\, I wasn't aware that any non-utf8 data in Perl is *always* ISO-8859-1\, I thought that\, when not specified\, this depended on some other stuff. Guess I need to reread the tutorials. :)

He\, because its not true :)

However\, this also poses the question: How does Perl know that your data is in KOI8-R?

It doesn't. Perl ideally only interprets character indices as unicode codepoints (I am ignoring use locale and similar issues here). So when you want to match your koi8-r data aginst a regex\, you need to decode it first. Perl doesn't know that and will *then* treat your character data as KOI8-R (and afterwards as unicode).

Unless you force perl to apply unicode interpretations to your characters\, they are completely encoding-free.

One of the limitations of the "there can be only two encodings" of Perl seems to be that strings are permanently upgraded:

Thats the root of the problem. There aren't two encodings. There is only one: characters concatenated to form strings.

Internally\, Perl currently has two forms for that\, just as perl can store real integers and doubles in a scalar.

But on the Perl level\, "5"\, "5.0"\, 5 and utf8-encoded 5 are all the same scalar.

if $$iso\_8859\_1 eq $utf8$ \{ \.\.\. \}
Please correct me if I am wrong\, but I do think it is not be possible to keep both variables in their current encoding and only temporarily upgrade them to utf8 (for the common encoding that contains both of them)?

It is\, but likely not very efficient as in most such cases you actually want utf-x internally. Except for optimisation purposes (where I see downgrade and upgrade as well-warranted)\, you do not have to care\, as perl handles thta automatically.

After reading this discussion here\, a lot of problems also seem to stem from the fact that the upgrade to utf8 is permanent\, silently and done "behind-the-scenes". Just like 1 + 2.0 will result in 3.0 and not 3 and we all know how much confusion this creates :) (heh\, I fell for it today\, even tho I should have know better :)

No\, there is no problem in most cases\, as the upgrade does not change the scalar in any way (except\, again\, for speed). Or at least should.

Perl achieves that goal by transparentlxy re-encoding its internal format as required. re-coding in that way does not change the semantics of the string\, except:

- when you hit a bug in perl - when you use unpack "C".

So in a bug-free perl without unpack\, everythign just works and you never need to care about wether perl stores the data as UCS-4\, UTF-X or octets in memory.

Thats the "sane" model introduced with 5.6 and mostly achieves with 5.8.8.

The problem are thre remainign bugs AND unpack\, the latter of which breaks existing programs that assume unpack "C" has byte semantics\, when\, in fact\, it returns the internal encoding that perl normally hides from you and tells you to ignore.

If those remaining problems were fixed (that included SvPV)\, the only difference between utf-x encoding and octet-encoding within perl would be speed\, but not semantics.

Thats the beauty.

Juerds goal of having the UTF-X flag exposed and having you to think about when perl upgrades and downgrades (and making you avoid the upgrades) is horrible\, as it forces a lot of administration on the programmer\, a lot of which perl already claims to do\, as only in a few cases you have to know your UTF-X flag at the moment.

The same type of string can be used for binary data\, because in the unicode encoding "latin1"\, all 256 codepoints map to the same byte values.

latin1 is not a unicode encoding in the first place.

Also\, I find it much more natural to represent bytes as characters 0..255 in perl\, as opposed to Juerds definition of characters 0..255 with the internal UTF-X flag cleared.

I just don't see why the programmer has to learn about that internal flag at all. If he has to\, then perl could become much much faster by forcing her to do that all the time\, instead of only in unpack or XS cases.

great minds sink alike or so) And since unlike in Perl\, upgradings are never done permanently\, you can keep your BINARY string and compare it to UTF-8 whatever\, and it never gets "corrupted".

In the 5.5 model\, nothing ever gets "corrupted"\, too. Thats the beauty of it. Because scalars with the UTF-X flag set behave the same way as scalars not having it set\, everything is compatible with each other.

Its only the cases _where_ it makes a difference where this is a problem and in fact stuff gets corrupted.

I am not sure how one could achive that in Perl. Making the SV read-only?

By fixing the remaining bugs and making the UTF-X flag truely internal\, so you do not have to worry about modules corrupting your stuff.

Thats what perl does for you in the vast majority of cases already\, and it should simply do that all the time\, so programmers have their typeless perl that they love again.

In short\, it becomes a mess.

Yes\, with strong typing\, especially with string subtypes for arbitrary encodings\, it would be cleaner. But it would also not look like Perl 5.

I beg to differ. Strong typing makes programming hard. Until Perl6 came and destroyed it\, the typeless nature of Perl was a feature\, not a problem.

Why should perl suddenly introduce types for strings when a single abstratc string type works just as wonderful as the single abstract scalar type works in perl already?

Having strongly typed integers/doubles/utf-8-strings etc. is a step backwards from perl towards Java.

Programmers using Perl do not want to worry about strict typing. They can use C++ or Java anytime for that.

Over the years\, I come to the insight that I want to build reliable and fast programs. (easy to maintain\, reliable\, fast\, pick two :-)

So maybe we really need "use strict 'encodings';" :-)

What for\, so that your program crashes at runtime instead of degrading to a slower but corretc case in case it happens to hit binary data? You surely do not want this\, or do you?

-- The choice of a -----==- _GNU_ ----==-- _ generation Marc Lehmann ---==---(_)__ __ ____ __ pcg@goof.com --==---/ / _ \/ // /\ \/ / http://schmorp.de/ -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE

p5pRT commented 17 years ago

From schmorp@schmorp.de

On Sat\, Mar 31\, 2007 at 01:39:06AM +0000\, Tels \nospam\-abuse@bloodgate\.com wrote:

My question was posed because I wanted to know how to *keep* a KOI8 (or any other random binary) string in Perl without converting it to Unicode. It seems to me this is not easily possible because there are literally dozend places where your KOI8 string might get suddenly upgraded to UTF-8 (and thus get corrupted because Perl treats it is ISO-8859-1). Or did I get this wrong?

Yes\, you did get that wrong\, liekly because Juerd wants users to care about that. But in fact\, if you try it\, nothing will get corrupted unless you use unpack "C" to get the first byte of your KOI8-string. Then you might get surprised (current perl) or an exception (Juerd's idea).

In an ideal world\, you could either just keep everything in utf-8 (that's too slow for some things and not fool-proof either)\, or rely on no other code to corrupt your data - especially this random third party module you pulled from CPAN last night. :)

In an ideal world\, you would just want to manipulate bytes == characters in Perl\, and do not care about how it treats it internally. It should treat it as fast as possible\, of course.

The same is true for other things in perl: you do not wan tto care wether your scalar contains an integer\, floatingpoint\, or strings. Use decides that in perl: if you print an integer scalar\, it (also) turns into a string. If you add a floating point number to an integer-only scalar\, you get the expected floatingpoint result.

Perl converts between all those "encodings" transparently in a way that makes most sense. And the same thing is true for character data.

There is a small diference\, as Perl can have scalars that have both a string and a double value\, for example\, and can then choose the fastest representation. Perl could just as well keep both an UTF-X encoded as well as a octet-encoded version of string around to optimise for speed.

Of course\, that optimisation would need a lot of memory\, so the trade-off choosen in the current implementation is to upgrade/downgrade when needed\, transparently\, so your KOI8-bytes stay KOI8-bytes all the time.

It is the few cases where perl doesn't do that I am concerned about.

OMHO the problem arises from the fact that Perl makes no distinction between a byte string like "a" and a text string like "a"\, and furthermore\, manipulating byte string (for instance appending a byte) is done with typical string operators. So:

Yeah. It also makes no difference between numbers and strings. Thats Perl.

\# works if $y is 7bit and no utf8 flag
\# but fails if $y is 7bit with utf8 flag
$byte\_string \.= $y;
As you said\, all is well as long as you can keep these two beasts seperate\, but the slightest problem might mangle your data. Such as a decode_utf8 setting the UTF8 bit on a 7bit ASCII string\, therefore changing the 7bit byte string to a text string.

No\, only in Juerd's model where binary data encoded in UTF-X is a bug. In real-world perl\, that just works fine\,a dn thats what I expect\, and thats I think what users expect\, too: not having to deal with the internal types.

In the same way\, you do not have a module that converts numbers to strings\, you just print them:

my $x = 5; print $x;

Again\, pelr transparently handles the details (which includes(!) character encoding for the outside world!).

As you said\, the current warnings::encode can't decide between the case of "BINARY + UTF_8" and "ISO-8859-1 + UTF_8" as Perl makes no distinction between binary data and ISO-8859-1. And this missing distinction is certainly a bother :)

Only when you hit bugs\, or unpack.

Greetings\,

-- The choice of a -----==- _GNU_ ----==-- _ generation Marc Lehmann ---==---(_)__ __ ____ __ pcg@goof.com --==---/ / _ \/ // /\ \/ / http://schmorp.de/ -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE

p5pRT commented 17 years ago

From schmorp@schmorp.de

On Sat\, Mar 31\, 2007 at 02:16:49AM +0200\, Juerd Waalboer \juerd@convolution\.nl wrote:

Marc Lehmann skribis 2007-03-31 2:12 (+0200):

Yes\, and the exact same is true for unicode (both have a 1-1 mapping between 0..255 and octets)\, trivially\, of course\, as unicode explicitly is a superset of latin1.

Unicode is a character set\, not a character encoding.

As is latin1.

A unicode string is a sequence of codepoints\, not octets.

Nope. You can encode unicode codepoints into UTF-8 and still end up with a unicode string. Encoding doesn't change the fact that it is unicode that your are storing.

Since it seems hard to grasp\, here is an example:

my $s = "Hello\, World!"; $s = Encode::encode_utf8 $s;

$s contains the famous greeting before and after the encoding. It is still an ASCII string\, iso-8859-15 string\, and a unicode string\, and a text string\, regardless of wether it is encoded or not\, that does not change the fact that that string contaisn the message "Hello\, World!".

If you drop ASCII\, the same is true for "Hallöchen!"\, which looks differently in UTF-8 then in an unencoded string\, but it is still the same message. And it is till using unicode to represent the characters.

The fact that you encode something does not change the something that you encode. Making an arbitrary difference only confuses the issue.

They don't map 1:1 to octets either. To express a unicode string in octects\, you need to encode it. For this\, there are several possibilities\, including UTF-8\, UTF-16\, ...

Sure. Octets are just things that store numbers between 0 and 255. The most compact way to do that in Perl is using a string. Thats also the most natural way to represent bytes in Perl\, closely followed by integers for single bytes.

You do not store octets in latin1\, or unicode\, or whatever else in that string. You are just using the most natural way to represent octets. And that just happens to work\, because Perl was designed to work that way.

The mapping between perl bytes and octets is 1:1.. ord and chr do it for you\, for example\, and unpack "n" does it for you in case you encode/decode two byte entities. unpack "C"\, however\, does not map to octets in perl. Thats the bug.

Unicode is a superset of the latin1 character set\, not the latin1 character encoding. We'd need bigger bytes for the latter :)

Right. And Perl has those bigger bytes.

-- The choice of a -----==- _GNU_ ----==-- _ generation Marc Lehmann ---==---(_)__ __ ____ __ pcg@goof.com --==---/ / _ \/ // /\ \/ / http://schmorp.de/ -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE

p5pRT commented 17 years ago

From @Juerd

Tels skribis 2007-03-31 1:39 (+0000):

My question was posed because I wanted to know how to *keep* a KOI8 (or any other random binary) string in Perl without converting it to Unicode. It seems to me this is not easily possible because there are literally dozend places where your KOI8 string might get suddenly upgraded to UTF-8 (and thus get corrupted because Perl treats it is ISO-8859-1). Or did I get this wrong?

A koi8r string is a byte string. If you keep it separated from text strings properly\, it should not be upgraded and thus treated as latin1. I'm very curious as to "sudden upgrades" that aren't related to mixing with text strings. Should you encounter them\, please let me know.

Indeed\, some functions and operations will not work properly on koi8r\, with regards to character properties. For example\, the regex engine has no idea which characters are word characters\, and which are cyrillic. It can only assume it's either ascii or latin1. For full functionality\, you must decode the string.

If your program is just a gateway in between other things\, and doesn't do any text processing\, just keep the thing a byte string.

Just like $jpeg_image is a byte string that contains JPEG data\, and this can be safely used\, $koi8r_string can be a byte string that contains koi8r text data.

especially this random third party module you pulled from CPAN last night. :)

Well\, yes\, modules sometimes have bugs. That's something we have to learn to live with.

As you said\, all is well as long as you can keep these two beasts seperate\, but the slightest problem might mangle your data.

That is true. Programming can be a delicate job. Has always been like that :)

Hm\, maybe one could write a module that always tackles the encoding to an SV via magic. (...) so that if you ever try to fuse two strings together where one of them is tagged binary\, you get an exception (but only then!).

That would be neat. You'd effectively have strong typing. I don't think you can do this in a module\, though. It requires checks all over the place. Maybe Scott Walters' typesafety module can be of help or inspiration: http://search.cpan.org/~swalters/typesafety-0.05/

Yeah\, I am not a genius :/ (Sometimes I wish I could upgrade my brain :)

But then\, it would be much slower! ;)

Codepoints 0..256 in latin1 map to byte values 0..256. That makes it special. Erm\, I don't buy this because: Codepoints 0..256 in KOI8-R (to pick one) map to byte values 0.256. That would make it special\, too.

I should have said "unicode codepoints 0..255 in latin1 map ...".

The interesting thing about latin1 is that 0..255 overlap with unicode. The 0..255 (not 256 btw\, silly mistake) in koi8-r can all be found in unicode somewhere\, but they're not all in exactly the same places. -- korajn salutojn\,

juerd waalboer: perl hacker \juerd@juerd\.nl \<http://juerd.nl/sig> convolution: ict solutions and consultancy \sales@convolution\.nl

Ik vertrouw stemcomputers niet. Zie \<http://www.wijvertrouwenstemcomputersniet.nl/>.

p5pRT commented 17 years ago

From @Juerd

Marc Lehmann skribis 2007-03-31 2:20 (+0200):

But in fact\, if you try it\, nothing will get corrupted unless you use unpack "C" to get the first byte of your KOI8-string. Then you might get surprised (current perl) or an exception (Juerd's idea).

Sigh...

It'll work just fine if the string is still a byte string\, which it will be if you cared to keep it separated. -- korajn salutojn\,

juerd waalboer: perl hacker \juerd@juerd\.nl \<http://juerd.nl/sig> convolution: ict solutions and consultancy \sales@convolution\.nl

Ik vertrouw stemcomputers niet. Zie \<http://www.wijvertrouwenstemcomputersniet.nl/>.

p5pRT commented 17 years ago

From schmorp@schmorp.de

On Sat\, Mar 31\, 2007 at 02:04:53AM +0200\, Juerd Waalboer \juerd@convolution\.nl wrote:

I've said it on p5p at least a dozen times\, but I'll say it again:

If the UTF8 flag is set\, you can be sure that you have a text string.

Repeating wrong statements does not make them true.

If you have a text string\, the UTF8 flag may or not be set. If you have a byte string\, the UTF8 string is not set (or it was set because you treated the byte string as a text string).

No\, please look at my example of JSON.

The problem with your approach is that you have to expose the UTF-X flag to users. Which comes with a lot of problems.

Again: you're kidding\, right?

I'm constantly very explicitly and verbosely telling people to NOT look at the flag\, NOT set it manually\, etcetera.

So why do you propose that people have to make sure that they never put a binary string with the UTF-X flag set into unpack?

How are users supposed to do that\, unless they know about he flag in the first place?

No\, I am not kidding. You are part of the crowd who wants to expose the UTF-X flag to the perl level\, despite your claims that you do not want to.

Heck\, I've even explained that I think you should try to (pretend to) be ignorant about the internals\, in response to your message even!

Right\, and then you want perl functions to die depending on the setting of that flag\, even though you also claim Perl users should not need to know about it.

So you tell users when they get that error message that they did somethign wrong that they should not care about?

No\, I am certainly not kidding.

I do not understand how you are able to misinterpret this message even after this many posts in this thread alone. Have you ever read perlunitut\, even?

As I said\, I have no such manpage\, and even if I had\, it has nothing to do with this. I am not misinterpreting your message at all.

You want perl functions to behave different depending on wether that flag is set or not. I want perl functions to behave the same\, regardless of the fact.

You expose the UTF-X flag that way. I don't.

You *are* contradicting yourself\, but that has nothing to do with me not reading that document or not. Thats alone your problem.

Either you do expose the UTF-X flag by making perl functions behave differently\, or you don't.

No matter of claiming you donot want to expose it can fix that: You do\, wether you want or not\, if you change Perl semantics to make a difference.

That's not what I said\, nor what I meant. In fact\, quite the opposite.

So then unpack should not croak when it sees the UTF-X flag?

If you're just spending this evening just to get on my nerves\, then congratulations!

No\, I am trying to make you understand the typeless nature of Perl\, and that your proposals expose the UTF-X flag\, no matter what you *want*.

You could just understand that for a change\, then maybe you wouldn't need to accuse me of just trying to get on your nerves.

I do understand that you said you do not want to expose that flag. But as long as the changes you propose do that\, it is being exposed.

I am sorry that I can't say it any clearer.

Because "internal format" strings can store binary data just as well\, and often does.

Yes\, and when you use such a byte string as a text string\, its bytes are considered to be codepoints\, just like in latin1.

Yeah\, sure. Mind you: no mention of UTF-X.

I am talking purely about the perl level strings. If perlunitut confused the issue by talking about internal encoding it completely failed its mission\, imho.

I strongly suggest that you READ the document before whining about its supposed failure.

Well\, I trust that you don't misquote its contents. Did you?

The problem is that some parts of perl make a difference bewteen the very same string\, depending on how it is encoded internally\, _even if the encoding is the same on the Perl level_.

Those are bugs. Report them\, and they might get fixed.

I did. Thats the whole point of this thread. I reported them a number of times. How could you miss that?

utf8::encode is a text operation. It will assume that whatever you give it\, is a text string. Its characters are considered Unicode codepoints. Where does it say so?

Well\, you have already denied that "encoding is going from characters to bytes" is a real world fact\, so I guess there's little point in pointing out the places where exactly the same thing is explained.

If it is wrong\, its wrong. No matter how often you try to explain it. People do store octets in UTF-8. Even perl extends UTF-8 to UTF-X to make interesting usages possible. So yes\, if thats broken\, then Pelr is already broken\, fundamentally\, by allowing non-unicode-codepoints in strings.

Choose two: your claims are wrong\, or Perl is wrong. Either way suits me\, although I personally think the current model makes much more sense then your user-has-to-care-for-UTF-X flag explicitly model.

you need to know some internals. Wrong. I need know no internals

A certain Marc Lehmann once said:

"I would love if that were the case\, but the powers to be decided that every perl progarmmer has to know those internals\, and needs to be able to deal with them."

Yes. Any problems with that?

As you like to quote with misleading context\, let me add that the context was unpack and perl modules using it or XS\, not utf8::encode.

You make a classical logical fallacy: just because some parts of Perl do not force you to know internals this does not mean that all of Perl does not force you.

That makes no sense\, because UTF-8 is a means of representing characters. Byte strings consist of bytes\, not characters. Not in C\, which is what the documentation constantly refers to\, mind you.

And that is bad\, I agree. Perl programmers should not be expected to speak C in order to understand Perl documentation. This is a big problem in Perl's documentation\, but who's going to fix it?

I donot suffer from it. I just want sane behaviour in Perl\, which doesn't force me to think about wether my UTF-X flag could be set and my program could die because of that\, but where I get the correct and expected results.

-- The choice of a -----==- _GNU_ ----==-- _ generation Marc Lehmann ---==---(_)__ __ ____ __ pcg@goof.com --==---/ / _ \/ // /\ \/ / http://schmorp.de/ -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE

p5pRT commented 17 years ago

From schmorp@schmorp.de

On Sat\, Mar 31\, 2007 at 02:33:55AM +0200\, Juerd Waalboer \juerd@convolution\.nl wrote:

places where your KOI8 string might get suddenly upgraded to UTF-8 (and thus get corrupted because Perl treats it is ISO-8859-1). Or did I get this wrong?

A koi8r string is a byte string. If you keep it separated from text

Your definiton is completely useless in the real world. Obviously\, a KOI8-R string is a text string. It contains text characters. End of story.

Just like $jpeg_image is a byte string that contains JPEG data\, and this

And it is actually an octet string (it makes no difference to C\, but it does make a difference in current Perls\, or on the wire).

I will not reply to your mails anymore\, as you made your point quite clear to me: you want behaviour to change dependingon the UTF-X flag\, but you do not want the programmer to know about that. You also have very weird ideas of what programmers should and should not do the defy reality. I find all that contradictory\, but as you ignore the evidence I presented and the question I asked you (JSON::XS example)\, I see no point in continuing talking to you.

(Note: this is not frustrated *plonk*. I don't hate you\, I just think it is pointless to argue about contradictory statements\, and I think you are mildly abusive\, too\, in assuming you know everything and therefore ingoring inconvinient questions. Feels to much like a waste of time).

I also might stay out of this discussion\, as I think I made my points clear. If Perl wants to stay broken w.r.t. Unicode abstraction\, it is not my fault\, I tried very hard over the last years to report bugs\, and so far\, all of my bug reports w.r..t unicode were right\, so I just assume I am not misinformed about how things should work.

Be good\, be well!

-- The choice of a -----==- _GNU_ ----==-- _ generation Marc Lehmann ---==---(_)__ __ ____ __ pcg@goof.com --==---/ / _ \/ // /\ \/ / http://schmorp.de/ -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE

p5pRT commented 17 years ago

From schmorp@schmorp.de

Ok\, last mail\, because this is a different topic :)

On Sat\, Mar 31\, 2007 at 01:08:21AM +0200\, Juerd Waalboer \juerd@convolution\.nl wrote:

Marc Lehmann skribis 2007-03-31 0:25 (+0200):

If you send a compressed string over the network using JSON and decompress it\, you need to know that.

Does JSON compress arbitrary data?

no.

If so\, then the user must do the decoding and encoding\,
No\, compression is something completely orthogonal from encoding. Neither forces me to do the other.
because arbitrary data only exists in byte form

Thats eems completely wrong to me.

Once you dictate any specific encoding\, it's no longer arbitrary.

JSON dictates unicode for the JSON text\, and strongly hints at the use of UTF-8 for interchange purposes.

On the other hand\, if JSON does text data only\,
No\, it does support binary data just as well. It is used a lot\, too.

It works just like perl without the bugs: You have a string type that can store bytes. It is up to the user to interpret them as she wants.

it can just use any UTF encoding on both sides\, and document it like that.

It is a bit complicated\, but you can safely assume that 99% of all JSON is UTF-8 encoded. In fact\, you can recode all JSON documents into ASCII\, too. JSON::XS offers that\, and JSON::XS by default encodes to/decodes from UTF-8\, but allows the user to decode/encode himself. JSON text is composed of unicode characters\, and in Perl some JSON modules store them as a simple Perl string.

All that is not well-supported by most JSON modules\, though\, for example JSON::XS is the only module for perl that correctly decodes escaped surrogate pairs.

Unless both sides are exactly the same platform (e.g. both Perl)\, you need to establish a protocol for sending data anyway. And that protocol should also describe encoding. If sender and receiver don't agree\, you have a problem.

No\, it doesn't have anything to do with the platform. Even when both sides use Perl I need to decide on a common encoding. Thats strictly outside the JSON definition\, though.

I am really frustrated at that. It makes perl as a whole rather questionable for unicode use\, as you constantly have to think about the internals. And yes\, that simply shouldn't be the case.

I maintain that it isn't the case\, for almost any programming job\, unless you're indeed doing things with internals.

Well\, the JSON::XS module certainly does things with the internals\, it has to flag some strings as UTF-X\, and in fact flags all strings that way unless you enable the shrink option\, which is documented to try to shrink the memory used in various ways (one way is to try to downgrade the scalar).

Certainly\, the user who reported the bug also didn't look at the internals. Compress::Zlib called unpack "CCCV" or somesuch\, though\, which unfortunately treats V very different from C\, by looking at the internals with "C"\, and not doing that and treating the string as an octte string with "V".

The user suggested that JSON::XS corrupts binary data because it happens to be returned upgraded unless you set the shrink option.

However\, Perl does not expose the internals elsewhere\, the upgraded version is semantically equivalent to the downgraded one unless you use an XS module using SvPV directly or indirectly (considered a bug in Perl when I understood nick correctly)\, or when using unpack "C"\, as that has a different meaning in perl 5.6 than in perl 5.005\, and has confusing documentation.

The right thing for Compress::Zlib is not to use unpack "CCCV" but unpack "UUUV"\, which seems completely weird to me\, as no unicode was ever involved *on the perl level*.

-- The choice of a -----==- _GNU_ ----==-- _ generation Marc Lehmann ---==---(_)__ __ ____ __ pcg@goof.com --==---/ / _ \/ // /\ \/ / http://schmorp.de/ -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE

p5pRT commented 17 years ago

From @Juerd

Marc Lehmann skribis 2007-03-31 2:42 (+0200):

Repeating wrong statements does not make them true.

I'll refrain from the obvious response.

No\, please look at my example of JSON.

JSON is pretty big to just quickly examine. I have nothing set up for testing it.

I'm constantly very explicitly and verbosely telling people to NOT look at the flag\, NOT set it manually\, etcetera. So why do you propose that people have to make sure that they never put a binary string with the UTF-X flag set into unpack?

Not unpack in general\, but unpack "C".

Because "C" is explicitly catered for byte data\, which strings with the UTF8 flag aren't. It won't always catch mistakes\, because indeed lack of the flag says nothing\, but it can help catch some of them.

Perl already has a similar warning in many places\, for example when you print such a "wide character" on a filehandle that has no encoding or utf8 layer. Some modules\, like MIME::Base64\, provide the same functionality.

How are users supposed to do that\, unless they know about he flag in the first place?

By keeping byte strings and text string separate. Please either accept this\, or stop asking me questions that will lead to this answer.

Right\, and then you want perl functions to die depending on the setting of that flag\, even though you also claim Perl users should not need to know about it.

The warning would not be a new feature\, but an existing feature applied in more places. "die" is probably too harsh indeed.

So you tell users when they get that error message that they did somethign wrong that they should not care about?

When they get the error message\, they can read the following in perldiag:

Wide character in %s (W utf8) Perl met a wide character (>255) when it wasn’t expecting one. This warning is by default on for I/O (like print). The easiest way to quiet this warning is simply to add the ":utf8" layer to the output\, e.g. "binmode STDOUT\, ’:utf8’". Another way to turn off the warning is to add "no warnings ’utf8’;" but that is often closer to cheating. In general\, you are supposed to explicitly mark the filehandle with an encoding\, see open and "binmode" in perlfunc.

Changing the order of these sentences is on my to-do list.

Note how this clear explanation doesn't mention the UTF8 flag!

As I said\, I have no such manpage

See bleadperl or Google.

You want perl functions to behave different depending on wether that flag is set or not. I want perl functions to behave the same\, regardless of the fact.

I want Perl to warn about certain mistakes when it can.

That's not what I said\, nor what I meant. In fact\, quite the opposite. So then unpack should not croak when it sees the UTF-X flag?

No\, it should warn instead. From now on\, I no longer think it should die. It should warn\, and people who want it to die can do so with "use warnings FATAL".

The problem is that some parts of perl make a difference bewteen the very same string\, depending on how it is encoded internally\, _even if the encoding is the same on the Perl level_. Those are bugs. Report them\, and they might get fixed. I did. Thats the whole point of this thread. I reported them a number of times. How could you miss that?

I don't usually read bug reports\, and never claimed to have done so.

But in this special case\, I will make an exception\, and read the Unicode related bug reports that you have submitted. -- korajn salutojn\,

juerd waalboer: perl hacker \juerd@juerd\.nl \<http://juerd.nl/sig> convolution: ict solutions and consultancy \sales@convolution\.nl

Ik vertrouw stemcomputers niet. Zie \<http://www.wijvertrouwenstemcomputersniet.nl/>.

p5pRT commented 17 years ago

From schmorp@schmorp.de

Oh\, maybe I know the reason for the confusion.

I do talk about the *Perl* level\, while you often talk about the *implementation*. When I say byte or octet string below\, I mean on the Perl level. For example\, on the Perl level\, upgrading a string does not change its semantics anywhere except w.r.t. to bugs and unpack: It still stays an octet string if it was an octet string before.

(Thats of course all in line with me not wanting to expose the UTF-X flag).

-- The choice of a -----==- _GNU_ ----==-- _ generation Marc Lehmann ---==---(_)__ __ ____ __ pcg@goof.com --==---/ / _ \/ // /\ \/ / http://schmorp.de/ -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE

p5pRT commented 17 years ago

From @Juerd

Marc Lehmann skribis 2007-03-31 2:29 (+0200):

Unicode is a character set\, not a character encoding. As is latin1.

For all intents and purposes\, latin1 is a character encoding as well as a character set. If not officially\, then certainly for Perl. It can be used with the :encoding layer\, with Encode'decode\, etcetera. "Unicode" cannot.

I don't know where your terminology comes from\, but I try to stick to whatever is common in Perl land. Sorry if that differs from other communities.

Unicode is a superset of the latin1 character set\, not the latin1 character encoding. We'd need bigger bytes for the latter :) Right. And Perl has those bigger bytes.

A byte\, in Perl jargon at least\, is an octet. An octet can hold any single value in the rande 0..255\, and is exactly 8 bits in size. Every byte is exactly as large as any other byte. -- korajn salutojn\,

juerd waalboer: perl hacker \juerd@juerd\.nl \<http://juerd.nl/sig> convolution: ict solutions and consultancy \sales@convolution\.nl

Ik vertrouw stemcomputers niet. Zie \<http://www.wijvertrouwenstemcomputersniet.nl/>.

p5pRT commented 17 years ago

From @Juerd

Marc Lehmann skribis 2007-03-31 2:48 (+0200):

A koi8r string is a byte string. If you keep it separated from text Your definiton is completely useless in the real world. Obviously\, a KOI8-R string is a text string. It contains text characters. End of story.

This is a logical thing to say\, but unfortunately not very useful.

The distinction between a text string\, and a byte string representing text\, is actually useful.

You also have very weird ideas of what programmers should and should not do the defy reality.

Weird ideas\, maybe\, but at least weird ideas that help dozens of people write working and maintainable code.

You don't believe in my weird ideas\, fine. But I find it very interesting that you run into all these problems with Perl's unicode support\, while the people who stick to my weird ideas write lots of code without that.

I find all that contradictory\, but as you ignore the evidence I presented and the question I asked you (JSON::XS example)\, I see no point in continuing talking to you.

Unfortunately\, I understand very little of the JSON example. I don't know JSON and would have to learn about it first. -- korajn salutojn\,

juerd waalboer: perl hacker \juerd@juerd\.nl \<http://juerd.nl/sig> convolution: ict solutions and consultancy \sales@convolution\.nl

Ik vertrouw stemcomputers niet. Zie \<http://www.wijvertrouwenstemcomputersniet.nl/>.

p5pRT commented 17 years ago

From @Juerd

Marc Lehmann skribis 2007-03-31 3:05 (+0200):

Oh\, maybe I know the reason for the confusion. I do talk about the *Perl* level\, while you often talk about the *implementation*. When I say byte or octet string below\, I mean on the Perl level.

This is not the reason for confusion\, because I also discuss the Perl level. For my terminology\, I use whatever is common in the Perl reference documentation.

For example\, on the Perl level\, upgrading a string does not change its semantics anywhere except w.r.t. to bugs and unpack: It still stays an octet string if it was an octet string before.

s/octet string/character string/ and you're entirely right. "Octets" are a bit harder\, because of the definition of an octet:

octet

\<jargon\, networking> Eight bits. This term is used in networking\, in preference to byte\, because some systems use the term "byte" for things that are not 8 bits long.

There's no easy way to fit numbers greater than 255 into 8 bits without sacrificing support for 0 thru 255 inclusive. It may even be impossible. Who knows. The person who invents a way of storing more than 255 distinct numbers in unique single octets\, will probably get famous very quickly :) -- korajn salutojn\,

juerd waalboer: perl hacker \juerd@juerd\.nl \<http://juerd.nl/sig> convolution: ict solutions and consultancy \sales@convolution\.nl

Ik vertrouw stemcomputers niet. Zie \<http://www.wijvertrouwenstemcomputersniet.nl/>.

p5pRT commented 17 years ago

From @Juerd

Juerd Waalboer skribis 2007-03-30 21:53 (+0200):

Personally\, I think that unpack with a byte-specific signature should die\, or at least warn\, when its operand has the UTF8 flag set.

I've since this post changed my mind\, and think it should only warn if there are wide characters after attempting to downgrade first. Just like the existing "wide character in %s" warning.

juerd@lanova:\~$ perl -wle'$a = "foo\x{ff}"; utf8::upgrade($a); print $a' | hexdump -C 00000000 66 6f 6f ff 0a |foo..| 00000005 juerd@lanova:\~$ perl -wle'$a = "foo\x{20ac}"; utf8::upgrade($a); print $a' | hexdump -C Wide character in print at -e line 1. 00000000 66 6f 6f e2 82 ac 0a |foo....| 00000007 -- korajn salutojn\,

juerd waalboer: perl hacker \juerd@juerd\.nl \<http://juerd.nl/sig> convolution: ict solutions and consultancy \sales@convolution\.nl

Ik vertrouw stemcomputers niet. Zie \<http://www.wijvertrouwenstemcomputersniet.nl/>.

p5pRT commented 17 years ago

From schmorp@schmorp.de

On Sat\, Mar 31\, 2007 at 03:53:25AM +0200\, Juerd Waalboer \juerd@convolution\.nl wrote:

Juerd Waalboer skribis 2007-03-30 21:53 (+0200):

Personally\, I think that unpack with a byte-specific signature should die\, or at least warn\, when its operand has the UTF8 flag set.

I've since this post changed my mind\, and think it should only warn if

We are making progress\, and I would actually be content with that solution\, but it does break "U". The solution\, really\, is to treat C like an octet in the same way "n" is treated like two octets. That does not break existing code and is what many perl programmers find naturally.

Since so many people are confused about why the unpack change breaks code\, I will explain it differently:

my $k = "\x10\x00"; die unpack "n"\, $k;

this gives me 4096. "n" is documented to take exactly 16 bits\, two octets.

I get 4096 regardless of how perl chooses to represent it internally: If perl goes to using UCS-4 (something that won't happen for sure\, but has been stated before to remind people that internal encoding can change)\, it would still work.

Same thing for "L"\, which is documented to be exactly 32 bit.

Now\, when people want an 8 bit value followed by a 16 bit big endian value\, they used "Cn" in the old times. In fact\, they still use that\, as "C" always has been the octet companion to the 16 bit and 32 bit sSlLnNvV etc.

However\, in a weird stroke\, somebody decided that "C" no longer gives you a single octet of your string\, but\, depending on internal encoding\, depending on an internal flag\, part of that octet or the octet.

Now\, what has been unpack "CCV" in perl 5.005 must be written as unpack "UUV" in perl 5.8\, as "U" has the right semantics for decoding a single octet out of a binary string.

Thats weird\, because now code that _doesn't_ want to deal with unicode at all\, but in fact only deals with binary data must use this unicode thingy "U"\, even though the documentation for "C" clearly says its an octet\, and even says its an octet in C\, which is exactly what those people decoding structures or network packets want.

That is the problem.

Now\, I don't mind at all if I get a die when trying "C" on a byte=character that is >255 (i.e. not representable as an object). Or a die when attempting that on a two byte=character string with "n".

I personally dislike the warning\, because the warning only ever comes up when there is a bug. It doesn't matter much to me persoanlly\, though.

What matters to me is that binary-only code now needs to use "U" when formerly "C" as meant to get correct behaviour. This *needs* to be fixed.

-- The choice of a -----==- _GNU_ ----==-- _ generation Marc Lehmann ---==---(_)__ __ ____ __ pcg@goof.com --==---/ / _ \/ // /\ \/ / http://schmorp.de/ -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE

p5pRT commented 17 years ago

From schmorp@schmorp.de

On Sat\, Mar 31\, 2007 at 03:21:06AM +0200\, Juerd Waalboer \juerd@convolution\.nl wrote:

For example\, on the Perl level\, upgrading a string does not change its semantics anywhere except w.r.t. to bugs and unpack: It still stays an octet string if it was an octet string before.

s/octet string/character string/ and you're entirely right. "Octets" are a bit harder\, because of the definition of an octet:

Please stop correcting completely correct statements. I am entirely right when I talk about octet strings above. It is a trivial fact. It is said when you think it isn't entirely correct\, but that doesn't give you the right to your current behaviour.

    \<jargon\, networking> Eight bits\. This term is used in

And stop lecturing me about basic stuff. I quite well know what an octet is\, and I am quite certain when I chose "octet" over "byte" or "character".

I meant "octet string" above\, and my statement is entirely correct with "octet string".

If you think you need to correct me\, please state why the above isn't entirely right in its original form. In fact\, I now assume you are very confused about that byte/octet/character stuff if you cannot even understand the correctness of simple facts like the sentence on top of this mail.

And I am full of your ridicule and belittlement. I am not impressed by people who make empty claims and miscorrect completely correct statements because they have difficulties understanding them. If you want to be taken seriously\, try it with logics (which you find not useful at times)\, not abusive behaviour (which certainly isn't useful anytime). Certainly I am not impressed by illogical arguments\, or even non-arguments such as personal coding style preferences\, which I happily tolerate as opinions\, but should never be presented as the only true way without sound arguments.

-- The choice of a -----==- _GNU_ ----==-- _ generation Marc Lehmann ---==---(_)__ __ ____ __ pcg@goof.com --==---/ / _ \/ // /\ \/ / http://schmorp.de/ -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE

p5pRT commented 17 years ago

From schmorp@schmorp.de

On Sat\, Mar 31\, 2007 at 03:15:50AM +0200\, Juerd Waalboer \juerd@convolution\.nl wrote:

Marc Lehmann skribis 2007-03-31 2:48 (+0200):

A koi8r string is a byte string. If you keep it separated from text Your definiton is completely useless in the real world. Obviously\, a KOI8-R string is a text string. It contains text characters. End of story.

This is a logical thing to say\, but unfortunately not very useful.

Thanks\, I'll take logical over subjective opinions any day.

The distinction between a text string\, and a byte string representing text\, is actually useful.

It is useful\, but making it the mandatory is stupid\, because you lose the ability to handle real-world situations\, for example JSON\, which simply does not make the distinction. Ther same is true for Pelr\, which also does not make the distinction.

You also have very weird ideas of what programmers should and should not do the defy reality.

Weird ideas\, maybe\, but at least weird ideas that help dozens of people write working and maintainable code.

Likely\, but its still your personal opinion\, your personal coding style. Forcing that on everybody else by calling everything that doesn't fit (such as JSON) "broken" does not convince _me_ that it is a good coding style.

You don't believe in my weird ideas\, fine. But I find it very interesting that you run into all these problems with Perl's unicode support\, while the people who stick to my weird ideas write lots of code without that.

Goddamnit\, I more than once told you that I am not running into those problems because I know most perl bugs regarding unicode inside and out. I am doing unicode programming for far longer than Perl easily supports it\, and I would be grateful if you would stop bullshitting me and spreading lies.

I *explicitly* said that it is other users who hit problems\, and that I can cope with them quite well.

I find all that contradictory\, but as you ignore the evidence I presented and the question I asked you (JSON::XS example)\, I see no point in continuing talking to you.

Unfortunately\, I understand very little of the JSON example. I don't know JSON and would have to learn about it first.

Well\, its one of that reality things where your coding style blankly breaks down: JSON makes no difference between binary and text\, except that binary only uses character indices 0..255. You do not know wether a json string is binary or text. Usage decides.

One such usage is unpack\, and I find it weird that I have to use "U" to get binary semantics in unpack. Or you have to downgrade explicitly.

Anyways\, that clashes with your notion that the programmer made a bug when binary data happens to be UTF-X encoded internally. Reality hits\, you lose\, simply because calling usage of JSON broken according to your coding standards will not have any effect on JSON.

And the way JSON handles binary is extremely common in the real world. And it is exactly how perl handles it\, modulo bugs and\, well\, unpack (and the unfortunate decision to give old XS code sometimes bytes encoded in UTF-X\, sometimes not).

Perl simply does _not_ work like you want it to. Instead\, it is much simpler because in the majority of cases it just works without having to track wether my binary string came in contact with something that upgraded it. I simply do not have to care in Perl\, except for the cases above.

And thats the good thing. Teaching people to avoid upgrading by your text vs. binary string technique is confusing. It is backwards. People should not have the need to be concerned about upgrading\, because it is an internal thing.

And yes\, I said I would not answer you\, but what prompted it was your continuous abusive behaviour of putting words into my mouth I have *explicitly* said to not have said\, and explaine dit in detail.

-- The choice of a -----==- _GNU_ ----==-- _ generation Marc Lehmann ---==---(_)__ __ ____ __ pcg@goof.com --==---/ / _ \/ // /\ \/ / http://schmorp.de/ -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE

p5pRT commented 17 years ago

From schmorp@schmorp.de

On Sat\, Mar 31\, 2007 at 03:03:21AM +0200\, Juerd Waalboer \juerd@convolution\.nl wrote:

JSON is pretty big to just quickly examine. I have nothing set up for testing it.

Not my problem. Your coding style cnanot handle it\, though\, so in your own interest you should try to examine it some day.

I'm constantly very explicitly and verbosely telling people to NOT look at the flag\, NOT set it manually\, etcetera. So why do you propose that people have to make sure that they never put a binary string with the UTF-X flag set into unpack?

Not unpack in general\, but unpack "C".

Because "C" is explicitly catered for byte data\, which strings with the UTF8 flag aren't.

Well\, you are not tlaking of Perl here.

It won't always catch mistakes\, because indeed lack of the flag says nothing\, but it can help catch some of them.

Having the flag means nothing\, either.

Perl already has a similar warning in many places\, for example when you print such a "wide character" on a filehandle that has no encoding or utf8 layer. Some modules\, like MIME::Base64\, provide the same functionality.

It is similar\, but it works completely different: It only warns if you pass something into a function/filehandle that knows that it is expecting binary data.

Unlike unpack\, the UTF-X flag has nothing to do with the warning: the warning tells you that the data you pass in is not binary data because it contains at least one character >255. Thats completely fine. But when I do pass in a string only consisting of octets (in the perl level)\, then it gets passed into the funciton as binary\, as one would expect.

And that\, again\, has nothing to do with the UTF-X flag. Data passed into such a function gets properly downgraded (that process is what actually generates the warning\, btw).

How are users supposed to do that\, unless they know about he flag in the first place?

By keeping byte strings and text string separate. Please either accept this\, or stop asking me questions that will lead to this answer.

I am asking about how users do that\, I am not askign what you think they should do. I am asking specifically _how_ your idea should be put into practise. I gave you an example where the only currently known way to do that is by knowing and manipulating the internal UTF-X flag.

And since you have not given an answer to that question\, it stays a valid question.

The problem is that your coding style cannot resolve this situation\, as the module in question (JSON::XS) does not know wether the given piece of data is binary or text. Only the user knows\, but by ghen it is already upgraded.

Right\, and then you want perl functions to die depending on the setting of that flag\, even though you also claim Perl users should not need to know about it.

The warning would not be a new feature\, but an existing feature applied in more places. "die" is probably too harsh indeed.

No part in perl acts like that\, see above\, the parts that generate that warning are all downrading properly\, ensuring the perl promises of string handling are kept.

When they get the error message\, they can read the following in perldiag:

   Wide character in %s
       \(W utf8\) Perl met a wide character \(>255\) when it wasn’t expecting one\.  This warning is by default on for I/O
       \(like print\)\.  The easiest way to quiet this warning is simply to add the "&#8203;:utf8" layer to the output\, e\.g\.
       "binmode STDOUT\, ’&#8203;:utf8’"\.  Another way to turn off the warning is to add "no warnings ’utf8’;" but that is
       often closer to cheating\.  In general\, you are supposed to explicitly mark the filehandle with an encoding\,
       see open and "binmode" in perlfunc\.

Changing the order of these sentences is on my to-do list.

You are completely confused. I am talking about octet strings (or byte strings in your parlance). That string _never_ triggers that warning\, regardless of how it is encoded internally\, because octte strings nver contain wide characters.

Thats how the abstraction should work.

Your change of warning when the UTF-X bit is set would break that abstraction\, because users suddenly would get that warning for strings that do not contain wide characters *at all*.

Thats I can only call very misleading to users.

Note how this clear explanation doesn't mention the UTF8 flag!

Exactly: because you didn't understand the mechanics of that warning because it doesn't do what you claim it does\, namely warn if the UTF-X flag is set but instead does the right thing and warns when there *is* a wide character in the string\, regardless of how it was encoded.

Do you finally understand? Please!

You want perl functions to behave different depending on wether that flag is set or not. I want perl functions to behave the same\, regardless of the fact.

I want Perl to warn about certain mistakes when it can.

No\, you want Perl to warn even when no mistakes happened because you equate UTF-X flag with "contains no (binary) octets/bytes".

But thats not how Perl works. Thats where you misunderstand how the UTF-X flag works. Perl warns on real problems (and probably should die)\, not because the UTF-X flag happens to be set\, which is misleading.

Do you finally understand how Perl works?

That's not what I said\, nor what I meant. In fact\, quite the opposite. So then unpack should not croak when it sees the UTF-X flag?

No\, it should warn instead. From now on\, I no longer think it should die. It should warn\, and people who want it to die can do so with "use warnings FATAL".

Of course it should not warn. That *exposes* the UTF-X flag to the user. And the warning you quote would simply be wrong\, because users would get that warning even when no wide character is in the string at all.

I don't usually read bug reports\, and never claimed to have done so.

But in this special case\, I will make an exception\, and read the Unicode related bug reports that you have submitted.

Maybe you learn what the UTF-X flag does\, and why it shouldn't be exposed in the way you think it should be or is currently exposed.

The UTF-X flag is *no* indication of a wide character whatsoever. In Perl.

I think its obvious by know that you are do not know very much about unicode handling vs. the UTF-X flag in Perl. At least your knowledge is mostly wrong it seems.

And thats sad\, because it could be very simple\, and for the most part already is very simple: Often used modules will simply be improved to use SvPVbyte explicitly\, even if there is no default typemap support for it. And Modules requiring binary data will eventually be fixed to use "U" instead of "C" for decoding single octets. And the rest of perl works relatively fine\, and the remaining issues will be fixed\, too.

I just think it would be much better for Perl if those changes were not required and things would just continue to work by providing backwards compatibility.

-- The choice of a -----==- _GNU_ ----==-- _ generation Marc Lehmann ---==---(_)__ __ ____ __ pcg@goof.com --==---/ / _ \/ // /\ \/ / http://schmorp.de/ -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE

p5pRT commented 17 years ago

From @Juerd

Marc Lehmann skribis 2007-03-31 7:55 (+0200):

Personally\, I think that unpack with a byte-specific signature should die\, or at least warn\, when its operand has the UTF8 flag set. I've since this post changed my mind\, and think it should only warn if We are making progress\, and I would actually be content with that solution\, but it does break "U".

No\, breaking U does not occur\, because it's not in my list of byte-specific (un)pack templates. U is for unicode characters.

The solution\, really\, is to treat C like an octet in the same way "n" is treated like two octets.

It does that\, but we're having a very different understanding of the word "octet"\, and my hands hurt\, so I'm not going through it all again.

Since so many people are confused about why the unpack change breaks code\, I will explain it differently: my $k = "\x10\x00"; die unpack "n"\, $k; this gives me 4096. "n" is documented to take exactly 16 bits\, two octets.

juerd@lanova:\~$ perl -le'print unpack "n"\, "\x{20ac}"' 57986

"\x{20ac}" is one character\, but "n" works on octets\, not characters. This uses the internal buffer without warning\, and picks the first two octets of the three-octet secuence e2 82 ac. This octet sequence should be hidden from the programmer\, but it is too late for that. So instead\, let's warn the programmer that what's going on is very probably not what they intended.

juerd@lanova:\~$ perl -le'print unpack "n"\, "\xe2\x82"' 57986

The annoying thing for people who don't know when Perl upgrades strings\, is when you started with a nice 2-octet byte string\, and it got upgraded somewhere. Here\, forced for illustration\, and using the same 2-octet sequence so the difference in results is obvious:

juerd@lanova:\~$ perl -le'$foo = "\xe2\x82"; utf8::upgrade($foo); print unpack "n"\, $foo' 50082

A warning about the wide characters here would be in order and save people's butts.

I get 4096 regardless of how perl chooses to represent it internally

Because Perl always uses latin1 or utf8 internally\, in both of which \x10 and \x00 are octets 0x10 and 0x00 respectively.

If perl goes to using UCS-4 (something that won't happen for sure\, but has been stated before to remind people that internal encoding can change)\, it would still work.

Not as far as I can tell\, because Perl uses the raw octets of the internal encoding whenever you do byte-specific operations\, and the internal encoding for U+0010 and U+0000 changes when you go from UTF-8 to UCS-4.

That's why it's so darn useful to use latin1 when possible\, because you can then be pretty sure that "\x10\x00" will be the two octets you expect. (Note that breaking this is the main breakage caused by encoding.pm.)

However\, in a weird stroke\, somebody decided that "C" no longer gives you a single octet of your string\, but\, depending on internal encoding\, depending on an internal flag\, part of that octet or the octet.

What you call "octet"\, I call "character". And I'll never call that "octet" or "byte" because then none of the documentation about all this would still be right\, and Perl would suddenly indeed be broken.

If you insist on calling the value of "\x{20ac}" a single octet\, then indeed pack/unpack will not do what you want\, because what you want is just not how it works.

"\x{20ac}" is one character. Internally\, represented by three octets. The internal representation is used\, if you unpack with byte-specific templates like "C" or "n".

Byte strings\, i.e. strings with no character values >255 that have never been in contact with UTF-8 encoded strings\, may be interpreted as latin1 and internally converted to UTF-8 when you join them with text strings. This causes unpack to see very different values\, and that's one of the reasons one should avoid mixing byte strings and text strings.

Note that my definition of "text string" excludes byte encoded strings\, such as the results of encode() or utf8::encode().

Now\, what has been unpack "CCV" in perl 5.005 must be written as unpack "UUV" in perl 5.8\, as "U" has the right semantics for decoding a single octet out of a binary string. Thats weird

Weird only because you choose to use a different meaning of the word "octet" than much of the rest of the world.

Now\, I don't mind at all if I get a die when trying "C" on a byte=character that is >255 (i.e. not representable as an object).

Just so other people know: since Perl has had Unicode support\, there has been a consistent effort to teach people that character != byte\, and that a single character may consist of several bytes.

In fact\, this effort has been present in larger parts of computing than just Perl\, but for clarity's sake\, I'm sticking to Perl because sometimes Perl's definitions differ. (For example\, in Perl\, a character is a single code point\, while in Unicode\, a character can be composed out of several combining code points.)

Also\, values greater than 255 do not fit in a single byte\, according to computer science that decided that byte==octet==8 bits. 8 bits simply simply hold only 2**8==256 values. Hence the need for a distinction between bytes\, and things that *are* able to hold other values.

I personally dislike the warning\, because the warning only ever comes up when there is a bug.

I love warnings that only ever come up when I have a bug. In fact\, I generally dislike warnings that don't follow that pattern. -- korajn salutojn\,

juerd waalboer: perl hacker \juerd@juerd\.nl \<http://juerd.nl/sig> convolution: ict solutions and consultancy \sales@convolution\.nl

Ik vertrouw stemcomputers niet. Zie \<http://www.wijvertrouwenstemcomputersniet.nl/>.

p5pRT commented 17 years ago

From q.eibcartereio.=~m-b.{6}-cgimosx@gumdrop.flyinganvil.org

On Sat\, Mar 31\, 2007 at 01:53:48AM +0200\, Marc Lehmann wrote:

In C\, a single byte is a character\, even if it happens to have a value higher than 255 (although very few compilers allow that\, usually\, a byte is an octet\, although it is common on DSPs to have 32 bit bytes).

Even if Perl encoded a single character into multiple C bytes/octets\, that does not mean its more than a single character.

The documentation is completely contradictory when it comes to "C" and can easily be interpreted to mean a single character in the C sense.

Fact is "even under Unicode" it doesn't work as advertised\, becasue Unicode can be internally represented in multiple ways in Perl.

I think that "char value" should be either removed from perlfunc\, or explained in more detail. It's NOT OBVIOUS to those who don't know C.

To those who do know C it has perfectly clear meaning\, namely a single character.

http://www.parashift.com/c++-faq-lite/intrinsic-types.html#faq-26.3

But that is not really relevant to the discussion.

Communication is difficult if you cannot express clearly what you are trying to say. Terminology is important to get correct\, and it is easy to confuse others or yourself if you are not precise when you need to be.

Unicode does not even HAVE characters\, it has codepoints. This did not happen by accident and is an important distinction to make.

$x = "ABCD"; $x = "\x41\x42\x43\x44"; $x = chr(65) . chr(66) . chr(67) . chr(68); $x = pack("C*"\, 65\, 66\, 67\, 68);

All of these put the same data into $x. [1] We can reasonably assume that $x contains a sequence of 4 bytes\, each 8 bits wide. We do not know anything about what $x is\, if it has an encoding\, if it is actually the output of pack "V"\, or maybe it came after "HTTP/1.1 GET ". The only reasonable thing to assume is that it is just a sequence of octets\, aka binary data.

Now consider the case of

$y = chr(1000);

Clearly whatever is in $y cannot be a single octet. The way Perl currently works (and this is my limited understanding here - someone with more knowledge can feel free to step in and correct my errors) is that now $y is considered to be a string of Unicode codepoints. So $y contains a single codepoint\, U+03E8. The internal flag is used to indicate that the internal data pointer points to something that is a "Unicode codepoint string".

What can we do with such a string? We can try to print it\, but if we have not converted it we get a message like

Wide character in print at - line 1.

and we get the bytes "cf a8" as output because that is the internal encoding.

print unpack("H*"\, $y);

produces "cfa8" as output\, again because we have been given access to the string as it exists upgraded.

On the other hand\,

print unpack("H*"\, pack("C"\, 1000));

produces "e8".

So consider again:

unpack("C*"\, $y);

This currently produces the list (207\, 168) which is again the internal encoding. What else should it do? If you expect values over 255\, then you should not use "C". If you don't have values over 255\, then why is your string not just a sequence of bytes? Something must have occurred to upgrade it to "sequence of unicode codepoints".

Of course if you have values over 255 you have to use "U" in unpack\, that only makes sense! On the other hand\, if you are agnostic to your string and just treat it as "data" then it will never get upgraded. So where is the issue?

It sounds to be that what you are trying to suggest is something along the lines of another type of Sv for the case of "unicode codepoint sequence"\, so that SvPV implicitly means "This scalar is not upgraded and is just data" and SvP_UnicodeArrayValue_ would contain the upgraded value. Then for anything that wanted a SvPV (XS code\, unpack "C") the only sensible thing would be to try to downgrade the string at that point and then emit a warning in the case of "wide characters" being present.

This is the point at which someone more familiar with internals chimes in and says "This has problems [backwards compatibility\, tuits\, other]." And of course this would preclude being able to inspect Perl's internal Unicode representation using unpack "C". :)

-- -Ben Carter Human beings\, who are almost unique in having the ability to learn from the experience of others\, are also remarkable for their apparent disinclination to do so. - Douglas Adams\, "Last Chance to See"

[1] I am deliberately ignoring the box in the corner labeled "EBCDIC".

p5pRT commented 17 years ago

From @abigail

On Sat\, Mar 31\, 2007 at 04:08:30AM -0600\, Ben Carter wrote:

Now consider the case of

$y = chr(1000);

Clearly whatever is in $y cannot be a single octet. The way Perl currently works (and this is my limited understanding here - someone with more knowledge can feel free to step in and correct my errors) is that now $y is considered to be a string of Unicode codepoints. So $y contains a single codepoint\, U+03E8. The internal flag is used to indicate that the internal data pointer points to something that is a "Unicode codepoint string".

No.

"ABCD" also contains 4 Unicode code points.

Perl strings only contain Unicode code points. Always.

The issue is not whether or not a string is a "Unicode" string or not\, the point is the *encoding* of the Unicode code points. That can be in UTF-8 (variable number of bytes/code point)\, or Latin-1 (one byte/character).

Unicode does not imply UTF-8.

Abigail

p5pRT commented 17 years ago

From nospam-abuse@bloodgate.com

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Moin\,

On Saturday 31 March 2007 00:29:42 Marc Lehmann wrote:

On Sat\, Mar 31\, 2007 at 02:16:49AM +0200\, Juerd Waalboer \juerd@convolution\.nl wrote:

Marc Lehmann skribis 2007-03-31 2:12 (+0200):

Yes\, and the exact same is true for unicode (both have a 1-1 mapping between 0..255 and octets)\, trivially\, of course\, as unicode explicitly is a superset of latin1.

Unicode is a character set\, not a character encoding.

As is latin1.

A unicode string is a sequence of codepoints\, not octets.

Nope. You can encode unicode codepoints into UTF-8 and still end up with a unicode string. Encoding doesn't change the fact that it is unicode that your are storing.

Since it seems hard to grasp\, here is an example:

my $s = "Hello\, World!"; $s = Encode::encode_utf8 $s;

$s contains the famous greeting before and after the encoding. It is still an ASCII string\, iso-8859-15 string\, and a unicode string\, and a text string\, regardless of wether it is encoded or not\, that does not change the fact that that string contaisn the message "Hello\, World!".

If you drop ASCII\, the same is true for "Hallöchen!"\, which looks differently in UTF-8 then in an unencoded string\, but it is still the same message. And it is till using unicode to represent the characters.

The fact that you encode something does not change the something that you encode. Making an arbitrary difference only confuses the issue.

Especially since Perl itself doesn't have any way to distinguish "a" (UNKNOWN ENCODING) from "a" (ASCII) from "a" (ISI-8859-1) from "a" (UTF-8) - except one bit :)

All the best\,

Tels

- -- Signed on Sat Mar 31 12:24:31 2007 with key 0x93B84C15. Get one of my photo posters: http://bloodgate.com/posters PGP key on http://bloodgate.com/tels.asc or per email.

"Most people\, I think\, don't even know what a rootkit is\, so why should they care about it?"

-- Thomas Hesse\, President of Sony BMG's global digital business division\, 2005. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux)

iQEUAwUBRg5TjXcLPEOTuEwVAQIrGAf417/05df4c3hIzTnFoidS3fAKWPHm9Ots 5BNa8n3PJci4cGQ2Sz7LzRf4BjD6+seW8Zq6fKNMIlCpmwCJYh/M+Ol8BBGefjhU tJxebJs1O2K+ZEd9cJTP/PP2bnqg9Z1CwiBNn8xT/cT8tbF6rR9kujaHooSkHnPV snDog7uLrk117tof8ORcybml0bDfhWzh4UfYOyue37RyrqAWnIXNOu24uYUjMiDT US3vym0LX+LUO4aBS9Ur/tX6FSBX/5mXDn0fPR016ESbzWA6TMMurSIjWYLFTw9R rRK0KSAb/z93Z6ZhHvyaKOz8Tt9ma44adu6WgTXrK5dcrpih8xbX =Q94f -----END PGP SIGNATURE-----

p5pRT commented 17 years ago

From nospam-abuse@bloodgate.com

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Moin\,

On Saturday 31 March 2007 00:33:55 Juerd Waalboer wrote:

Tels skribis 2007-03-31 1:39 (+0000):

My question was posed because I wanted to know how to *keep* a KOI8 (or any other random binary) string in Perl without converting it to Unicode. It seems to me this is not easily possible because there are literally dozend places where your KOI8 string might get suddenly upgraded to UTF-8 (and thus get corrupted because Perl treats it is ISO-8859-1). Or did I get this wrong?

A koi8r string is a byte string. If you keep it separated from text strings properly\, it should not be upgraded and thus treated as latin1. I'm very curious as to "sudden upgrades" that aren't related to mixing with text strings. Should you encounter them\, please let me know.

"Keeping things seperate" is not working in the Real World[tm]. As far as I can see so:

#!/usr/bin/perl -w use Encode qw/decode/; my $random = "\xc3\xc3"; # some random bytes my $ascii = "a"; # some 7bit data

# Somebody "helpfull" decodes the ascii string: # The encoding doesn't actually matter\, since it is 7bit anyway. # This step happens out of my control (e.g. in third party code) $string = decode('ISO-8859-1'\, $ascii);

# now take our random binary data and a 7bit ascii string and do: print join (" "\, unpack("CCC"\, "$random$string"))\, "\n"; print join (" "\, unpack("CCC"\, "$random$ascii"))\, "\n";

Now explain to me why this prints different things even tho $random is the same string in both cases\, and $string and $ascii should be the same\, too. :) Bonus points if you manage to not mention the uhh -- ut - utf -- uhm -- er The Flag[tm].

So far\, I can see the ways to handle this are:

* replace C with U (lots of code review work\, plus it still means you 200Mbyte TIFF file might make a trip to UTF-8 land and back) * always forcefully downgrade stuff in 7bit ASCII (wastefull) and just hope your 8bit data never get's in contact with anything with The Flag[tm] * never mix fire and water er dogs and cats er I mean text and bytes\, and pray that every piece of code out there to adheres to this\, too.

I think the Pray and Hope[tm] strategy doesn't really work\, tho.

All the best\,

Tels

- -- Signed on Sat Mar 31 12:09:53 2007 with key 0x93B84C15. Get one of my photo posters: http://bloodgate.com/posters PGP key on http://bloodgate.com/tels.asc or per email.

"Sundials don't work\, the one I've had in my basement hasn't changed time since I installed it." grub (11606) on 2004-12-03 on /.

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux)

iQEVAwUBRg5Sz3cLPEOTuEwVAQJvegf+OVl0Ha2tJ3QIXmkUs+XHXWdYIqtu9xJe VeBwrelub65lfgIfD8FnNmft+KgZDE8S8QU3sjFo5NArtVT56tFsAeIwtdtC23au BcobxZxkI9iHWJtkJYlxKHEdSPbWSgJiWfJ7J3fc4zprme3/Zlxgpcd3pyiRee0m AhpnZ6dui033dNakhZCHu1L/YeUyP72OmGmtWOAJLHGIQ/w0nUrUJrx5kg3WuV88 ATfl7EFVZOxqavSSWJCgBHXvU8iRUg4mmqpoVPY4S9uqMi9IYCZBPZNAc++MSjbn b0e8+qPTB43zah6EfNSc5Xq22EDEjx7mu0n62FQhajV1lOIoc0kV7g== =CfKu -----END PGP SIGNATURE-----

p5pRT commented 17 years ago

From @Juerd

Tels skribis 2007-03-31 12:23 (+0000):

\#\!/usr/bin/perl \-w
use Encode qw/decode/;
my $random = "\\xc3\\xc3";        \# some random bytes
my $ascii = "a";        \# some 7bit data

\# Somebody "helpfull" decodes the ascii string&#8203;:
\# The encoding doesn't actually matter\, since it is 7bit anyway\.
\# This step happens out of my control \(e\.g\. in third party code\)
$string = decode\('ISO\-8859\-1'\, $ascii\);

$string is a text string\, now. Remember\, decoding is going from byte string to text string.

Using unpack "C" on a text string makes no sense if you consider that this "C" doesn't stand for "character" in the sense that the documentation for chr\, ord\, length\, split\, etcetera use. It stands for "char"\, which is a C datatype that contains one byte.

As such\, unpack "C" is a byte operation and makes sense on byte strings only. $string is a text string\, and you can tell by looking at the decode() step.

\# now take our random binary data and a 7bit ascii string and do&#8203;:
print join \(" "\, unpack\("CCC"\, "$random$string"\)\)\, "\\n";

Dangerous\, and that's why I suggested adding a "wide character in..." warning earlier in this thread.

Now explain to me why this prints different things even tho $random is the same string in both cases\, and $string and $ascii should be the same\, too. :) Bonus points if you manage to not mention the uhh -- ut - utf -- uhm -- er The Flag[tm].

I get the bonus points! Hurrah! :)

The only explanation that I used is the separation between text strings and binary strings. It's also the only thing you need to know. You'll benefit from knowing more\, certainly\, but I see red flags in your code.

So far\, I can see the ways to handle this are: (..) * never mix fire and water er dogs and cats er I mean text and bytes\, and pray that every piece of code out there to adheres to this\, too.

Exactly.

I think the Pray and Hope[tm] strategy doesn't really work\, tho.

It doesn't always work\, because people can't be trusted to do the right thing\, but it can always be fixed. -- korajn salutojn\,

juerd waalboer: perl hacker \juerd@juerd\.nl \<http://juerd.nl/sig> convolution: ict solutions and consultancy \sales@convolution\.nl

Ik vertrouw stemcomputers niet. Zie \<http://www.wijvertrouwenstemcomputersniet.nl/>.

p5pRT commented 17 years ago

From @Juerd

Ben Carter skribis 2007-03-31 4:08 (-0600):

Unicode does not even HAVE characters\, it has codepoints.

Very good point\, but Perl's documentation refers to codepoints as "characters"\, and does that rather consistently.

I'm considering sweeping through the docs and changing it all\, but it would be a lot of work and a huge patch. I wonder if it's worth that.

Now consider the case of $y = chr(1000); Clearly whatever is in $y cannot be a single octet. The way Perl currently works is that now $y is considered to be a string of Unicode codepoints.

Yes.

But to go into a bit more detail for the more interesting case of chr(233): this is either a byte string with only one byte\, or a text string with only one cha^Wcodepoint. Perl doesn't know\, or care\, so the programmer has to.

So $y contains a single codepoint\, U+03E8. The internal flag is used to indicate that the internal data pointer points to something that is a "Unicode codepoint string".

No\, see Abigail's response for clarification.

print unpack("H*"\, pack("C"\, 1000));

Feeding 1000 to C has undefined behaviour: the C type can only handle values 0..255\, and there's no documentation defining what happens if you feed it something \<0 or >255. A similar thing occurs with floating point numbers\, like 64.5. The current implementation truncates that to 64\, without warning.

If you expect values over 255\, then you should not use "C".

Indeed!

Of course if you have values over 255 you have to use "U" in unpack\, that only makes sense!

If these values are codepoints\, yes. But if they're just numbers\, other unpack templates\, like perhaps N or V are better.

[1] I am deliberately ignoring the box in the corner labeled "EBCDIC".

Oh\, so am I. In fact\, I've probably never even seen such a box in my short life so far. -- korajn salutojn\,

juerd waalboer: perl hacker \juerd@juerd\.nl \<http://juerd.nl/sig> convolution: ict solutions and consultancy \sales@convolution\.nl

Ik vertrouw stemcomputers niet. Zie \<http://www.wijvertrouwenstemcomputersniet.nl/>.

p5pRT commented 17 years ago

From nospam-abuse@bloodgate.com

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Moin\,

On Saturday 31 March 2007 16:09:18 Juerd Waalboer wrote:

Tels skribis 2007-03-31 12:23 (+0000):

\#\!/usr/bin/perl \-w
use Encode qw/decode/;
my $random = "\\xc3\\xc3";        \# some random bytes
my $ascii = "a";        \# some 7bit data

\# Somebody "helpfull" decodes the ascii string&#8203;:
\# The encoding doesn't actually matter\, since it is 7bit anyway\.
\# This step happens out of my control \(e\.g\. in third party code\)
$string = decode\('ISO\-8859\-1'\, $ascii\);

$string is a text string\, now. Remember\, decoding is going from byte string to text string.

Yes\, but my point was that I:

* might not be the one who "decoded" $string or produced it even. * do not know if I am passed a "text" string as there is only the flag-you-should-not-know-about to distinguish these two.

Using unpack "C" on a text string makes no sense if you consider that this "C" doesn't stand for "character" in the sense that the documentation for chr\, ord\, length\, split\, etcetera use. It stands for "char"\, which is a C datatype that contains one byte.

As such\, unpack "C" is a byte operation and makes sense on byte strings only. $string is a text string\, and you can tell by looking at the decode() step.
\# now take our random binary data and a 7bit ascii string and do&#8203;:
print join $" "\, unpack\("CCC"\, "$random$string"$\)\, "\\n";
Dangerous\, and that's why I suggested adding a "wide character in..." warning earlier in this thread.

Now explain to me why this prints different things even tho $random is the same string in both cases\, and $string and $ascii should be the same\, too. :) Bonus points if you manage to not mention the uhh -- ut - utf -- uhm -- er The Flag[tm].

I get the bonus points! Hurrah! :)

Not really\, as you didn't explain the difference\, you merely told me "there is a difference" (where me personally don't expect to be a difference)

The only explanation that I used is the separation between text strings and binary strings. It's also the only thing you need to know. You'll
benefit from knowing more\, certainly\, but I see red flags in your code.

Ok\, and how am I supposed know that in:

sub dosomething { my $a = shift; }

$a is a text string or a binary string? :)

So far\, I can see the ways to handle this are: (..) * never mix fire and water er dogs and cats er I mean text and bytes\, and pray that every piece of code out there to adheres to this\, too.

Exactly.

This is not a working strategy.

I think the Pray and Hope[tm] strategy doesn't really work\, tho.

It doesn't always work\, because people can't be trusted to do the right thing\, but it can always be fixed.

Only if you consider your own code. But data is sometimes processed by other code (Perl itself\, some module etc.).

All the best\,

Tels

- -- Signed on Sat Mar 31 18:33:51 2007 with key 0x93B84C15. Get one of my photo posters: http://bloodgate.com/posters PGP key on http://bloodgate.com/tels.asc or per email.

"We're looking at a future where only the very largest companies will be able to implement software\, and it will technically be illegal for other people to do so."

-- Bruce Perens\, 2004-01-23 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux)

iQEVAwUBRg6qqXcLPEOTuEwVAQINCAf/QWq653liE6ZUnR5sUrO8YFVXU0Gi5s/m wm4teby4dypHRuyjKov7a2XeheRCZU+iYXnlNFk8Tioqd3ZOwlZC5uGbufX1QnpO H9lYRtDTG14BHH2D+QsMgSrPcAXwsnvSdlePAmy4m9TJ3xQTtzcPLTWt2p8tgiul URl0lgMHv7I9ASJusYwPa00YRFDexpdVuYpclTtnzzVPoGkuMxAKIDhhAuKp9uSl gWJXGiha9hvGEZOh2k6mGZ/bkstEMhp3vrqU1ccp11jfahsaAwvU9EVS7254t22R KqXh3Ca4/lMxs+2+1xW0j518Asq0sB/L6gkyGr0tHdFgQwX7S71yoA== =K82l -----END PGP SIGNATURE-----

p5pRT commented 17 years ago

From @Juerd

Tels skribis 2007-03-31 18:38 (+0000):

* might not be the one who "decoded" $string or produced it even. * do not know if I am passed a "text" string as there is only the flag-you-should-not-know-about to distinguish these two. (...) Ok\, and how am I supposed know that in: sub dosomething {my $a = shift; } $a is a text string or a binary string? :)

No\, not even the flag-you-should-not-know-about doesn't distinguish between the two.

When you're writing a library function to handle arbitrary data\, you'll have to pick sides\, either text or binary. Fortunately\, the choice is often very simple.

When you can't choose between these two\, you could write two functions: one for text data\, one for binary data. Often you can write the text function simply by using the binary thing underneath\, with a specified UTF encoding.

If you're just serializing data\, you could opt for storing the literal internal buffer along with the state of the UTF8 flag\, or (exactly like the previous paragraph) pick any specific encoding and stick to that.

If you happen to have a function in a current API (i.e. not a contrived one) for which you find it hard to decide\, please let me know the details. I'll help you offlist.

Only if you consider your own code. But data is sometimes processed by other code (Perl itself\, some module etc.).

Yes\, indeed. This can be troublesome. Especially many\, many modules still don't correctly support Unicode. I'm slowly but surely compiling a list at http://juerd.nl/perluniadvice. Wanna help? -- korajn salutojn\,

juerd waalboer: perl hacker \juerd@juerd\.nl \<http://juerd.nl/sig> convolution: ict solutions and consultancy \sales@convolution\.nl

Ik vertrouw stemcomputers niet. Zie \<http://www.wijvertrouwenstemcomputersniet.nl/>.

p5pRT commented 17 years ago

From @Juerd

Warnocked? -- korajn salutojn\,

juerd waalboer: perl hacker \juerd@juerd\.nl \<http://juerd.nl/sig> convolution: ict solutions and consultancy \sales@convolution\.nl

Ik vertrouw stemcomputers niet. Zie \<http://www.wijvertrouwenstemcomputersniet.nl/>.

p5pRT commented 12 years ago

@cpansprout - Status changed from 'open' to 'rejected'