Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
2.14k stars 587 forks source link

perl's unicode conversion fails when iconv succeeds [rt.cpan.org #73623] #11833

Closed p5pRT closed 13 years ago

p5pRT commented 13 years ago

Migrated from rt.perl.org#107326 (status was 'rejected')

Searchable as RT107326$

p5pRT commented 13 years ago

From perl-diddler@tlinx.org

This is a bug report for perl from perl-diddler@​tlinx.org\, generated with the help of perlbug 1.39 running under perl 5.12.3.


[Please describe your issue here]

Was looking at ways to do upper/lower case compare\, and bumped into piconv as being a 'drop in replacement for "iconv"'. So I decided to try it thinking it would be a 'hoot' if it was faster.

Rather than faster\, it choked at the beginning of my 98M test file (i.e. I truncated the file to the first few lines\, 672 bytes)\, which reproduces the problem just fine .. TrƩs sad...

p5pRT commented 13 years ago

From perl-diddler@tlinx.org

test.in

p5pRT commented 13 years ago

From @cpansprout

On Fri Dec 30 10​:41​:46 2011\, LAWalsh wrote​:

This is a bug report for perl from perl-diddler@​tlinx.org\, generated with the help of perlbug 1.39 running under perl 5.12.3.

----------------------------------------------------------------- [Please describe your issue here]

Was looking at ways to do upper/lower case compare\, and bumped into piconv as being a 'drop in replacement for "iconv"'. So I decided to try it thinking it would be a 'hoot' if it was faster.

Rather than faster\, it choked at the beginning of my 98M test file (i.e. I truncated the file to the first few lines\, 672 bytes)\, which reproduces the problem just fine .. Trļæ½s sad...

Youā€˜re right​:

$ piconv5.15.6 -f utf16 -t utf-8 /Users/sprout/Downloads/test.in UTF-16​:Unrecognised BOM d at /usr/local/lib/perl5/5.15.6/darwin-thread-multi-2level/Encode.pm line 196\, \<$ifh> line 2.

The file begins with \\.

If I use utf-16le explicitly\, it does the first line correctly\, but quickly switches to Chinese\, which means it’s off by one byte. If I use utf-16be explicitly\, the first line is in Chinese.

This is part of the Encode distribution\, for which CPAN is upstream\, so I’m forwarding this to the CPAN ticket.

--

Father Chrysostomos

p5pRT commented 13 years ago

From @cpansprout

test.in

p5pRT commented 13 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 13 years ago

@cpansprout - Status changed from 'open' to 'rejected'

p5pRT commented 13 years ago

From bug-Encode@rt.cpan.org

\<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri Dec 30 14​:00​:32 2011\, perlbug-followup@​perl.org wrote​:

On Fri Dec 30 10​:41​:46 2011\, LAWalsh wrote​:

This is a bug report for perl from perl-diddler@​tlinx.org\, generated with the help of perlbug 1.39 running under perl 5.12.3.

----------------------------------------------------------------- [Please describe your issue here]

Was looking at ways to do upper/lower case compare\, and bumped into piconv as being a 'drop in replacement for "iconv"'. So I decided to try it thinking it would be a 'hoot' if it was faster.

Rather than faster\, it choked at the beginning of my 98M test file (i.e. I truncated the file to the first few lines\, 672 bytes)\, which reproduces the problem just fine .. Trļæ½s sad...

Youā€˜re right​:

$ piconv5.15.6 -f utf16 -t utf-8 /Users/sprout/Downloads/test.in UTF-16​:Unrecognised BOM d at /usr/local/lib/perl5/5.15.6/darwin-thread-multi-2level/Encode.pm line 196\, \<$ifh> line 2.

The file begins with \\.

If I use utf-16le explicitly\, it does the first line correctly\, but quickly switches to Chinese\, which means it’s off by one byte.

It sounds like it's reading line-by-line\, where a line is a sequence of bytes ended by 0A. Of course\, that's wrong for UTF-16le (and UTF-16be\, for that matter).

p5pRT commented 13 years ago

From bug-Encode@rt.cpan.org

\<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

Fix​:

- my $need2slurp = $use_bom{ find_encoding($to)->name }; + my $need2slurp = $use_bom{ find_encoding($from)->name }; + if ($Opt{debug}){ + printf "Read mode​: %s\n"\, $need2slurp ? 'Slurp' : 'Line'; + }

p5pRT commented 13 years ago

From bug-Encode@rt.cpan.org

\<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri Dec 30 17​:49​:01 2011\, ikegami wrote​:

Fix​:

- my $need2slurp = $use_bom{ find_encoding($to)->name }; + my $need2slurp = $use_bom{ find_encoding($from)->name }; + if ($Opt{debug}){ + printf "Read mode​: %s\n"\, $need2slurp ? 'Slurp' : 'Line'; + }


Not to be pushy or anything\, but where does one apply that fix? I couldn't find a any need2slurp in my /usr/lib/perl5/{5.1{0.0\,2.{1\,3}}.0\,{site\,vendor}_perl} library dirs\, so I don't know that the above lines were responsible for this particular breakage...but then I may not be searching in the right spots...

As for the lines in the file I submitted-- they looked like they all had CRLF as line separators...

p5pRT commented 13 years ago

From @ikegami

On Fri\, Dec 30\, 2011 at 6​:15 PM\, Linda A Walsh via RT \< bug-Encode@​rt.cpan.org> wrote​:

\<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri Dec 30 17​:49​:01 2011\, ikegami wrote​:

Fix​:

- my $need2slurp = $use_bom{ find_encoding($to)->name }; + my $need2slurp = $use_bom{ find_encoding($from)->name }; + if ($Opt{debug}){ + printf "Read mode​: %s\n"\, $need2slurp ? 'Slurp' : 'Line'; + }

---- Not to be pushy or anything\, but where does one apply that fix? I couldn't find a any need2slurp in my /usr/lib/perl5/{5.1{0.0\,2.{1\,3}}.0\,{site\,vendor}_perl} library dirs\, so I don't know that the above lines were responsible for this particular breakage...but then I may not be searching in the right spots...

As for the lines in the file I submitted-- they looked like they all had CRLF as line separators...

piconv

p5pRT commented 13 years ago

From bug-Encode@rt.cpan.org

\<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri\, Dec 30\, 2011 at 6​:15 PM\, Linda A Walsh via RT \< bug-Encode@​rt.cpan.org> wrote​:

\<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri Dec 30 17​:49​:01 2011\, ikegami wrote​:

Fix​:

- my $need2slurp = $use_bom{ find_encoding($to)->name }; + my $need2slurp = $use_bom{ find_encoding($from)->name }; + if ($Opt{debug}){ + printf "Read mode​: %s\n"\, $need2slurp ? 'Slurp' : 'Line'; + }

---- Not to be pushy or anything\, but where does one apply that fix? I couldn't find a any need2slurp in my /usr/lib/perl5/{5.1{0.0\,2.{1\,3}}.0\,{site\,vendor}_perl} library dirs\, so I don't know that the above lines were responsible for this particular breakage...but then I may not be searching in the right spots...

As for the lines in the file I submitted-- they looked like they all had CRLF as line separators...

piconv

p5pRT commented 13 years ago

From @ikegami

On Fri\, Dec 30\, 2011 at 6​:15 PM\, Linda A Walsh via RT \< bug-Encode@​rt.cpan.org> wrote​:

As for the lines in the file I submitted-- they looked like they all had CRLF as line separators...

Probably. And not really relevant.

piconv was treating your file as a series of lines ending with 0A *before decoding*. LF is not 0A in UTF-16le\, and an 0A is not necessarily part of a LF in UTF-16le.

p5pRT commented 13 years ago

From bug-Encode@rt.cpan.org

\<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri\, Dec 30\, 2011 at 6​:15 PM\, Linda A Walsh via RT \< bug-Encode@​rt.cpan.org> wrote​:

As for the lines in the file I submitted-- they looked like they all had CRLF as line separators...

Probably. And not really relevant.

piconv was treating your file as a series of lines ending with 0A *before decoding*. LF is not 0A in UTF-16le\, and an 0A is not necessarily part of a LF in UTF-16le.

p5pRT commented 13 years ago

From bug-Encode@rt.cpan.org

\<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri Dec 30 17​:49​:01 2011\, ikegami wrote​:

Fix​:

- my $need2slurp = $use_bom{ find_encoding($to)->name }; + my $need2slurp = $use_bom{ find_encoding($from)->name }; + if ($Opt{debug}){ + printf "Read mode​: %s\n"\, $need2slurp ? 'Slurp' : 'Line'; + }

Partly works​: piconv -f UTF-16 -t UTF-8 \<test.in >test.out iconv -f UTF-16 -t UTF-8 \<test.in >testi.out cmp testi.out test.out && echo ok ok piconv -f UTF-8 -t UTF-16 \<test.out >test2.out cmp testi.in test2.out test.in test2.out differ​: byte 1\, line 1

test.out was same size

p5pRT commented 13 years ago

From bug-Encode@rt.cpan.org

\<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri Dec 30 18​:44​:35 2011\, LAWALSH wrote​:

On Fri Dec 30 17​:49​:01 2011\, ikegami wrote​:

Fix​:

- my $need2slurp = $use_bom{ find_encoding($to)->name }; + my $need2slurp = $use_bom{ find_encoding($from)->name }; + if ($Opt{debug}){ + printf "Read mode​: %s\n"\, $need2slurp ? 'Slurp' : 'Line'; + }

Partly works​: piconv -f UTF-16 -t UTF-8 \<test.in >test.out iconv -f UTF-16 -t UTF-8 \<test.in >testi.out cmp testi.out test.out && echo ok ok piconv -f UTF-8 -t UTF-16 \<test.out >test2.out cmp testi.in test2.out   ^^^^ typo.. was 'test'...

anyway. the piconv doesn't do round trip\, the way iconv does.

Sounds like it might be assuming UTF-16 means BE and not LE?

Just a WAG..

p5pRT commented 13 years ago

From bug-Encode@rt.cpan.org

\<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri Dec 30 18​:49​:46 2011\, LAWALSH wrote​:

On Fri Dec 30 18​:44​:35 2011\, LAWALSH wrote​:

# piconv -f UTF-8 -t UTF-16 \<test.out >test2.out # cmp test.in test2.out test.in test2.out differ​: byte 1\, line 1 test.out was same size

Sounds like it might be assuming UTF-16 means BE and not LE?


Yup​:

cmp -l -b test.in test2.out   1 377 M-^? 376 M-~   2 376 M-~ 377 M-^?   3 127 W 0 ^@​   4 0 ^@​ 127 W   5 151 i 0 ^@​ ... 671 12 ^J 0 ^@​ 672 0 ^@​ 134 \ cmp​: EOF on test.in

p5pRT commented 13 years ago

From @ikegami

On Fri\, Dec 30\, 2011 at 6​:44 PM\, Linda A Walsh via RT \< bug-Encode@​rt.cpan.org> wrote​:

\<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri Dec 30 17​:49​:01 2011\, ikegami wrote​:

Fix​:

- my $need2slurp = $use_bom{ find_encoding($to)->name }; + my $need2slurp = $use_bom{ find_encoding($from)->name }; + if ($Opt{debug}){ + printf "Read mode​: %s\n"\, $need2slurp ? 'Slurp' : 'Line'; + }

Partly works​: piconv -f UTF-16 -t UTF-8 \<test.in >test.out iconv -f UTF-16 -t UTF-8 \<test.in >testi.out cmp testi.out test.out && echo ok ok piconv -f UTF-8 -t UTF-16 \<test.out >test2.out cmp testi.in test2.out test.in test2.out differ​: byte 1\, line 1

C\<\< decode('UTF-16'\, ...) >> both requires a BOM and removes it (intentionally).

If you want to keep the BOM\, use UTF-16le (the actual encoding) instead of UTF-16.

This is unrelated to this ticket.

- Eric

p5pRT commented 13 years ago

From bug-Encode@rt.cpan.org

\<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri\, Dec 30\, 2011 at 6​:44 PM\, Linda A Walsh via RT \< bug-Encode@​rt.cpan.org> wrote​:

\<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri Dec 30 17​:49​:01 2011\, ikegami wrote​:

Fix​:

- my $need2slurp = $use_bom{ find_encoding($to)->name }; + my $need2slurp = $use_bom{ find_encoding($from)->name }; + if ($Opt{debug}){ + printf "Read mode​: %s\n"\, $need2slurp ? 'Slurp' : 'Line'; + }

Partly works​: piconv -f UTF-16 -t UTF-8 \<test.in >test.out iconv -f UTF-16 -t UTF-8 \<test.in >testi.out cmp testi.out test.out && echo ok ok piconv -f UTF-8 -t UTF-16 \<test.out >test2.out cmp testi.in test2.out test.in test2.out differ​: byte 1\, line 1

C\<\< decode('UTF-16'\, ...) >> both requires a BOM and removes it (intentionally).

If you want to keep the BOM\, use UTF-16le (the actual encoding) instead of UTF-16.

This is unrelated to this ticket.

- Eric

p5pRT commented 13 years ago

From @ikegami

On Fri\, Dec 30\, 2011 at 7​:01 PM\, Eric Brine \ikegami@&#8203;adaelis\.com wrote​:

On Fri\, Dec 30\, 2011 at 6​:44 PM\, Linda A Walsh via RT \< bug-Encode@​rt.cpan.org> wrote​:

\<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri Dec 30 17​:49​:01 2011\, ikegami wrote​:

Fix​:

- my $need2slurp = $use_bom{ find_encoding($to)->name }; + my $need2slurp = $use_bom{ find_encoding($from)->name }; + if ($Opt{debug}){ + printf "Read mode​: %s\n"\, $need2slurp ? 'Slurp' : 'Line'; + }

Partly works​: piconv -f UTF-16 -t UTF-8 \<test.in >test.out iconv -f UTF-16 -t UTF-8 \<test.in >testi.out cmp testi.out test.out && echo ok ok piconv -f UTF-8 -t UTF-16 \<test.out >test2.out cmp testi.in test2.out test.in test2.out differ​: byte 1\, line 1

Correction/elaboration​:

C\<\< decode('UTF-16'\, ...) >> both requires a BOM and removes it

(intentionally).

...and C\<\< encode('UTF-16'\, ...) >> adds it back\, but uses UTF-16be instead of UTF-16le.

You need C\<\< -to UTF-16le >> to use UTF-16le (instead of UTF-16be)\, but that won't add the BOM\, you need to avoid removing it in the first place by using C\<\< -from UTF-16le >>.

- Eric

p5pRT commented 13 years ago

From bug-Encode@rt.cpan.org

\<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri\, Dec 30\, 2011 at 7​:01 PM\, Eric Brine \ikegami@&#8203;adaelis\.com wrote​:

On Fri\, Dec 30\, 2011 at 6​:44 PM\, Linda A Walsh via RT \< bug-Encode@​rt.cpan.org> wrote​:

\<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri Dec 30 17​:49​:01 2011\, ikegami wrote​:

Fix​:

- my $need2slurp = $use_bom{ find_encoding($to)->name }; + my $need2slurp = $use_bom{ find_encoding($from)->name }; + if ($Opt{debug}){ + printf "Read mode​: %s\n"\, $need2slurp ? 'Slurp' : 'Line'; + }

Partly works​: piconv -f UTF-16 -t UTF-8 \<test.in >test.out iconv -f UTF-16 -t UTF-8 \<test.in >testi.out cmp testi.out test.out && echo ok ok piconv -f UTF-8 -t UTF-16 \<test.out >test2.out cmp testi.in test2.out test.in test2.out differ​: byte 1\, line 1

Correction/elaboration​:

C\<\< decode('UTF-16'\, ...) >> both requires a BOM and removes it

(intentionally).

...and C\<\< encode('UTF-16'\, ...) >> adds it back\, but uses UTF-16be instead of UTF-16le.

You need C\<\< -to UTF-16le >> to use UTF-16le (instead of UTF-16be)\, but that won't add the BOM\, you need to avoid removing it in the first place by using C\<\< -from UTF-16le >>.

- Eric

p5pRT commented 13 years ago

From bug-Encode@rt.cpan.org

\<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri Dec 30 19​:04​:31 2011\, ikegami@​adaelis.com wrote​:

On Fri\, Dec 30\, 2011 at 7​:01 PM\, Eric Brine \ikegami@&#8203;adaelis\.com wrote​:

On Fri\, Dec 30\, 2011 at 6​:44 PM\, Linda A Walsh via RT \< bug-Encode@​rt.cpan.org> wrote​:

\<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri Dec 30 17​:49​:01 2011\, ikegami wrote​:

Fix​:

- my $need2slurp = $use_bom{ find_encoding($to)->name }; + my $need2slurp = $use_bom{ find_encoding($from)->name }; + if ($Opt{debug}){ + printf "Read mode​: %s\n"\, $need2slurp ? 'Slurp' : 'Line'; + }

Partly works​: piconv -f UTF-16 -t UTF-8 \<test.in >test.out iconv -f UTF-16 -t UTF-8 \<test.in >testi.out cmp testi.out test.out && echo ok ok piconv -f UTF-8 -t UTF-16 \<test.out >test2.out cmp testi.in test2.out test.in test2.out differ​: byte 1\, line 1

Sounds like it might be assuming UTF-16 means BE and not LE?


Yup​: cmp -l -b test.in test2.out 1 377 M-^? 376 M-~ 2 376 M-~ 377 M-^?

Correction/elaboration​:

C\<\< decode('UTF-16'\, ...) >> both requires a BOM and removes it

(intentionally).


How is that a correction??

...and C\<\< encode('UTF-16'\, ...) >> adds it back\, but uses UTF-16be instead of UTF-16le.


Ah\, then there's two rubs​:

1)...why would encode convert to BE on a LE machine? Seems like exactly the wrong decision to make.

2) since piconv states that is "designed to be a drop in replacement for iconv" and "iconv seems to assume LE"\, (maybe it only does so on LE machines?)... then I would assert there is a still a problem.

p5pRT commented 13 years ago

From @ikegami

On Fri\, Dec 30\, 2011 at 9​:15 PM\, Linda A Walsh via RT \< bug-Encode@​rt.cpan.org> wrote​:

\<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 > How is that a correction??

I was correcting what *I* said.

1)...why would encode convert to BE on a LE machine?

What does Encode have to do with your machine?

2) since piconv states that is "designed to be a drop in replacement for

iconv" and "iconv seems to assume LE"\, (maybe it only does so on LE machines?)... then I would assert there is a still a problem.

Yes. Go ahead a file a bug if you want.

p5pRT commented 13 years ago

From bug-Encode@rt.cpan.org

\<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri\, Dec 30\, 2011 at 9​:15 PM\, Linda A Walsh via RT \< bug-Encode@​rt.cpan.org> wrote​:

\<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 > How is that a correction??

I was correcting what *I* said.

1)...why would encode convert to BE on a LE machine?

What does Encode have to do with your machine?

2) since piconv states that is "designed to be a drop in replacement for

iconv" and "iconv seems to assume LE"\, (maybe it only does so on LE machines?)... then I would assert there is a still a problem.

Yes. Go ahead a file a bug if you want.

p5pRT commented 13 years ago

From bug-Encode@rt.cpan.org

\<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri Dec 30 23​:26​:12 2011\, ikegami@​adaelis.com wrote​:

On Fri\, Dec 30\, 2011 at 9​:15 PM\, Linda A Walsh via RT \< bug-Encode@​rt.cpan.org> wrote​:

\<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 > How is that a correction??

I was correcting what *I* said.

1)...why would encode convert to BE on a LE machine?

What does Encode have to do with your machine?


That's where the test was run.

Data is usually in the machines native format unless you are specifically trying to export it somewhere (like over the Net\, then 'network byte order' is used).

2) since piconv states that is "designed to be a drop in replacement for

iconv" and "iconv seems to assume LE"\, (maybe it only does so on LE machines?)... then I would assert there is a still a problem.

Yes. Go ahead a file a bug if you want.


The original test case showed using iconv 2 directions... for some reason the perlbug SW chopped that off .. anything after the uuencoded file I included\, ws chopped off... that had a whole explanation and demonstration of the bug using the above data file (above in the original bug report that seems to have been corrupted by perl's bug system).

The bug was the piconv didn't work as a drop in for iconv as I took a simple doc and converted to utf-8 and then back to utf-16\, and original and the twice converted compared identical.

I tried to do the same with piconv\, but piconv failed at the first step.

Why the original bug report was truncated at the data point\, seems to be another bug in the perlbug reporting system.

Perhaps it would be better to report that one as this one is still not fixed as the title perl';s conversion fails when iconv succeeds is still true. That's why I said 'closer'\, but not quite there.

p5pRT commented 13 years ago

From zefram@fysh.org

Linda A Walsh via RT wrote​:

Data is usually in the machines native format unless you are specifically trying to export it somewhere

That was the case in the 1980s. Times have changed; machines are more interconnected than they used to be.

-zefram

p5pRT commented 13 years ago

From bug-Encode@rt.cpan.org

\<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

Linda A Walsh via RT wrote​:

Data is usually in the machines native format unless you are specifically trying to export it somewhere

That was the case in the 1980s. Times have changed; machines are more interconnected than they used to be.

-zefram

p5pRT commented 13 years ago

From perl-diddler@tlinx.org

Zefram via RT wrote​:

Linda A Walsh via RT wrote​:

Data is usually in the machines native format unless you are specifically trying to export it somewhere

That was the case in the 1980s. Times have changed; machines are more interconnected than they used to be. -zefram


  This has NOT changed. It was addressed in the 1980's.

If you are Networking\, you used network byte order. If you are doing processing on the same machine\, you use native byte order.

To do otherwise is to incur horrible inefficiencies.

You can't do a string search on modern architectures USING their native instruction sets\, if you put data in an ALIEN format.

Intel has string compare assembly instructions that start at the beginning of a byte string\, and go from the start of the string\, in low memory\, (even on BE machines -- which is one reason they fell out of favor\, they were structurally flawed for string operations).

In the west\, we read from left to right\, so to list numbers\, we would put number in the order : 0 1 2 3 4 5 6 7 8 9 11 12 This is the same order that today's computers use. low(starting) memory on the left\, and you place bytes into memory in human readable order. If you look at memory\, you would see 0 1 2 3 4 5 ... (or 30 20 31 20 33 20\, where the 3x = the numbers\, and the 20= the spaces).

On a BE machine you don't know what you will see\, because the string is different depending on the word size used to store the string and the native word size of the machine. When the network standard was defined only 32-bit BE machines were at all prevalent. so as numbers\, if stored in NETWORD\, byte order\, (not necessarily BE order\, as BE is always relative to the word size....
I.e. if I packed them as an arrary\, at 16-bit intervals\, I would see​: 2 1 4 3 6 5 8 7 10 9 12 11. If I packed them into 32 bit words first\, I'd see 4 3 2 1 8 7 6 5 12 11 10 9. If I packed them into a 64bit word (we have 64-bit machines today)\, we'd see 8 7 6 5 4 3 2 1 0 0 0 0 12 11 10 9. If you packed them into a 1 byte array\, and looked at memory as bytes.\, you'd see the same order as you see in all 3 cases on a LE machine. That's the advantage and likely the biggest reason why LE machines are dominant today. It doesn't matter if you pack them in as bytes\, words\, 32-bit DWORDS\, or 64-bit QWORDS\, the order is the same.

So a BE machine talking to another BE machine of the same word size\, may benefit by putting them in BE order\, BUT\, the majority of computers used by consumers and in the IT world\, for processing are LE based. It makes no sense to default to a format that they can't use their native instruction set on without 'converting'.

You are choosing to deliberately create inefficience for most of the world\, to follow the example of the 1960's/1970's mainframes that are now extinct for a reason -- they didn't work well together\, and each was specialized ... Now we build things out of parts and build up\, so a subrouting getting a parameter\, doesn't care if you passed it 1 byte\, on a 8080\, or 2 bytes on an 8088\, or 4 bytes on a 586\, or 8 bytes on a x86_64 bit machine... ALL of those subroutines wold work -- unmodified\, since the low byte is always 1st in memory\, and that's what they pay attention to.

With BE machines\, no generation was compatible with the next. because the native byte order would be different at each word size.

Data over the internet for 4-byte or 16-byte addresses is in BE order because it makes sense for routing equipment that has to look at the high parts first\, for routing decisions\, just like you look at (country)(province/state)(city)(street)(street addr)[subnumber]. It is most efficient for address parsing in network equipment to be able to look at the most significant parts first. But humans? and Computers doing internal work? A human can never look at the 1st digit and make any sense of it. A computer can\, only if it knows the length of the item coming in (which is usually the case in a language (a sub that takes a byte\, a word\, or whatever)...

Perl doesn't represent or store strings in memory on todays machines in BE order unless it is running on a BE machine. It is an error for conversion of characters to default to non-native order\, as that's NEVER the _default_\, internally -- only in the circumstance of some explicit specification would it use non-native format. It just doesn't make any sense.

On top of all of the above\, Piconv was supposed to be a drop in replacement for iconv. It was meant to be a "demonstration of the unicode technology in Perl" -- it's a BIG FAIL\, if it doesn't generate the same output. It cannot be used as a drop in replacement due -- why? For the same reason why printf/sprintf aren't parallel on perl\, or the same reason why their is duplicate code in the perl interpreter for "use" and "require"\, so that the case for "use"\, can "special case" (quirk) functionality to "disallow" the same logical functionality that "require" possesses.

It's not cute\, and it's not just quirky\, it's simply harmful to anyone who might want perl to evolve into something that didn't have so many odd and non-intuitive exceptions.

How is it that you would want a document in a word order that is alien to your machine (when you are known to be exporting? why you'd convert to a non-native format when the next most likely thing to do would be to process that document locally?

Was it a particular 'screw you to Microsoft'? Who was the first major ventor to define and Use 16-bit values for unicode and who did so in 'LE'. Even apple went to Intel (though who knows if they will stay w/it)... but who uses BE machine and to whom\, is the current behavior/default useful to?

p5pRT commented 13 years ago

From @ikegami

On Tue\, Jan 3\, 2012 at 6​:09 PM\, Linda Walsh \perl\-diddler@&#8203;tlinx\.org wrote​:

**

If you are Networking\, you used network byte order. If you are doing processing on the same machine\, you use native byte order.

To do otherwise is to incur horrible inefficiencies.

Reading UTF-16le​:

UV c; c = *(p++); c |= *(p++) \<\< 8;

Reading UTF-16be​:

UV c; c = *(p++) \<\< 8; c |= *(p++);

I don't see anything platform-dependent or any "horrible inefficiencies".

- Eric

p5pRT commented 13 years ago

From perl-diddler@tlinx.org

Eric Brine wrote​:

On Tue\, Jan 3\, 2012 at 6​:09 PM\, Linda Walsh \<perl-diddler@​tlinx.org \mailto&#8203;:perl\-diddler@&#8203;tlinx\.org> wrote​:

If you are Networking\, you used network byte order\.ļæ½ If you are
doing processing
on the same machine\, you use native byte order\.

To do otherwise is to incur horrible inefficiencies\.

Reading UTF-16le​:

UV c; c = *(p++); c |= *(p++) \<\< 8;


Wouldn't your target be a buffer pointer? I.e. because you are converting from one buffer to another? (and I always get chided for not showing my work...)

So that \, above is really *c = *(p++) ... etc...

Except that if the count is large\, or greater than 4 (normal case) on LE machines\, you do 4 at a time and skip the shifts(\<\<) and ors(|)​:

if you are on a 64bit machine\, then if w=cc\, (1word composed of 2 chars) and d=ww\, and q=dd if w=cc\, and d=ww\, then to unpack your LE string\, You'd divide the length by 8 so on a 96 meg file I like I was using\, you'd do 12million *1 store on of course you'd make sure it was aligned on a 64-bit boundary... thus you incur no SIGBUS's that have to be handled in hardware that slow you down.

0 SIGBUS handles/loop (done in hw on intel\, but you can turn off the HW handling and have it take a SIGBUS for any non-aligned data\, and you'll realize how much data is pushed around unaligned\, taking at least twice as long just for the memory accesses\, not counting the SIGBUS service time (even if it is in HW)...

1 load\, 1 store\, and 2 adds/loop *12 million loops (96meg data)

*(q++)=*((unsigned int)q++) for any count >=8 that's 1 assign\,

vs.

*(c++) = *(p++) \<\< 8; *(c++) |= *(p++); *(c++) = *(p++) \<\< 8; *(c++) |= *(p++); *(c++) = *(p++) \<\< 8; *(c++) |= *(p++); *(c++) = *(p++) \<\< 8; *(c++) |= *(p++);

at least 14 SIGBUS events/loop + (1 will liekly line up on each side\, but 7/8 times they won't. vs. 8 loads and stores + 16 adds\, + 8 shifts\, masks and ors. (the mask is implicit if you are using a character data type. -- because has to be loaded into a register from memory first and the top 24 or 56 bits of memory (32/64 bit) have to be masked off to get you your byte. There might be more masks depending on the types\, but lets just call it 1\, so 8 load & stores\, 16 add/loop\, 16 mask and 16 ORS\, The or's mask and adds are likely close to each other in speed ( with in an order of magnitude... so 48 of those. the 8 loads & stores

Well so far we are at 8 times as many loads and stores 700% overhead and 48 int-ops\, vs. 2\, or 24 times as many\, (2300% overhead)\,

+ SIGBUS... overhead... 14/loop .. each penalizes a load /store at least by 2x\, (has to hit 2 memory positions\,

so our 8 storeloads get penalized by ***minimum*** (assuming 0 time to process the SIGBUS and just load memory)\, ) an extra 14/loop\, so that's really 22 v 2 = load-n-stores\, or 11x\, that's 1000% overhead...

so the 1000% + the intops 2300 -> 3300% overhead/loop or 35x slower.

I don't see anything platform-dependent or any "horrible inefficiencies".

You don't call a 35X slowdown\, or 3300% overhead 'horrible'?

Geez....

Might want to re-examine the bad code ...

Considering it has to be done for all the chars\, just the 4x reduction in loop iterations\, would be a bonus.\, let alone removal of all those extraneous ops...

Being able to examine code like the above is a main reason why everyone should have basic computer science education in this day and age\, though a degree is helpful...

(though the market doesn't pay for it\, cause they don't car about a 35x slowdown). Consumers can just wait...

p5pRT commented 13 years ago

From @ikegami

On Wed\, Jan 4\, 2012 at 4​:34 PM\, Linda Walsh \perl\-diddler@&#8203;tlinx\.org wrote​:

**

Eric Brine wrote​:

On Tue\, Jan 3\, 2012 at 6​:09 PM\, Linda Walsh \perl\-diddler@&#8203;tlinx\.orgwrote​:

If you are Networking\, you used network byte order.ļæ½ If you are doing processing on the same machine\, you use native byte order.

To do otherwise is to incur horrible inefficiencies.

Reading UTF-16le​:

UV c; c = *(p++); c |= *(p++) \<\< 8;

----

Wouldn't your target be a buffer pointer?

No. Perl doesn't use arrays of codepoints. Even if it did\, it doesn't change anything anyway.

// UTF-16le UV* c = ...; *c = *(p++); *(c++) |= *(p++) \<\< 8;

is not anymore efficient than

// UTF-16be UV* c = ...; *c = *(p++) \<\< 8; *(c++) |= *(p++);

Except that if the count is large\, or greater than 4 (normal case) on

LE machines\, you do 4 at a time and skip the shifts(\<\<) and ors(|)​:

You can't do that for the first 0..3 characters because of alignment issues.

You can't do that for the last 0..3 characters because of boundary issues.

You can't do that since UTF-16 is a variable width format. (You are incorrectly creating two characters in the destination buffer where there is only one.)

*(q++)=*((unsigned int)q++) for any count >=8

Alignment error (not counting the missing "*").


This is the code. Note how UTF-16le ('v') is no faster than UTF-16be ('n').

static UV enc_unpack(pTHX_ U8 **sp\, U8 *e\, STRLEN size\, U8 endian) {   U8 *s = *sp;   UV v = 0;   if (s+size > e) {   croak("Partial character %c"\,(char) endian);   }   switch(endian) {   case 'N'​:   v = *s++;   v = (v \<\< 8) | *s++;   case 'n'​:   v = (v \<\< 8) | *s++;   v = (v \<\< 8) | *s++;   break;   case 'V'​:   case 'v'​:   v |= *s++;   v |= (*s++ \<\< 8);   if (endian == 'v')   break;   v |= (*s++ \<\< 16);   v |= (*s++ \<\< 24);   break;   default​:   croak("Unknown endian %c"\,(char) endian);   break;   }   *sp = s;   return v; }

p5pRT commented 13 years ago

From @paulg1973

At the risk of getting into a \ war\, let me say that the posting by Ms. Walsh contains many statements that I believe are inaccurate. Anyone coming across that post in an archive in the future is advised to draw their own conclusions independent of her statements. I’d be grateful if she would cite sources. In my case\, I lived through this history and can offer my personal knowledge.

1. ā€œBig-Endian machines fell out of favor\, in part\, because they were structurally flawed for string operations.ā€

This is just silly. There are plenty of examples of Big-Endian architectures that have efficient machine instructions for string operations. I personally worked on the Big-Endian Honeywell 6000 mainframes\, which had a rich extended instruction set (ā€œEISā€) that handled decimal data (up to 59 digits of precision!) and character-string operations of almost indefinite length. But perhaps you are referring to the original IBM mainframe (System/360)\, unarguably the most successful mainframe of its day\, if not of all time. While it is true that its support for character (aka ā€œlogicalā€) data was fairly minimal\, to say that it was structurally deficient is to use 2012 values to judge a 1965 design. It could move logical data\, compare logical data\, and edit decimal data into logical data; very clever stuff for its time. Big-Endian machines have in fact never fallen out of favor; there are still plenty of successful examples around. And those that did fall out of favor can’t simply blame it on Big-Endian integers; there were plenty of good reasons from the business and management side of the house.

2. An implied assumption that little-endian machines have binary representations that are easier to read (say\, in a dump of storage) than big-endian machines.

Again\, this is silly. I spent years working on big-endian machines and had no trouble reading the dumps. I still find it lots easier to read a big-endian dump than a little-endian dump. In either case\, you need a cheat-sheet that shows the storage layout. At least with Big-Endian formats you don’t have to byte-swap the integers. I have yet to find a human that doesn’t have to byte-swap all but the most trivial Little-Endian integer to figure out its decimal value.

3. It makes no sense to use a different endian than native\, except in the network world.

I work on a very successful operating system (Stratus OpenVOS) in which all user data is big-endian\, yet the underlying processor is little-endian. The byte swaps add a negligible overhead\, thanks to clever optimizations by Intel. Our design makes perfect business and technical sense; it made the task of porting 25 years of source code from a big-endian processor (HP PA-RISC) to a little-endian processor (x86) a LOT easier for us and for our customers. I understand that Digital Equipment Corp played a similar trick when they ported FORTRAN from their Big-Endian 36-bit machines (PDP-10\, PDP-20) to their Little-Endian 16-bit/32-bit machines (PDP-11).

4. The mainframes of the 60s and 70s are extinct because they were incompatible with each other.

I was there. They are extinct because they were hugely expensive\, not terribly reliable\, and over the years\, we’ve invented much better products. They were incompatible with each other because (a) just making the darn things work was hard\, (b) there was no economic incentive to make them compatible\, (c) companies were reinventing the technology at a rapid clip and that required dropping the ideas that didn’t work out (the IBM System/360 was not compatible with the IBM 7094). Also\, (d) networking came along pretty late in the game. If you take the years after World War II as the start of modern-day computers\, it took 25 years before computer-to-computer networking came into being (the ARPAnet). Networking before the ARPAnet consisted of sneakernet (carrying punch cards\, tapes\, or removable disks); worked great and was plenty fast enough for the day.

5. Data over the internet is in big-endian order because it is more efficient for routing.

I’d love to see your source for this comment. Again\, I was there. My memory is that the established machines (of the late 60s and early 70s\, when the ARPAnet was being invented) were Big-Endian. The upstarts were Little-Endian. I always thought that the designers just picked BE as the native format because that was the machine they were using at the same and were most familiar with. But I have only my memories for this\, not a source.

In my opinion\, standardization in the computer industry is the sign that innovation has ceased. Or to put it another way\, the industry eventually decides that some piece of technology is good enough and there is no good reason to try to improve it. Eventually\, some sort of paradigm shift happens and major changes blow away layers of technology\, but until then\, we have a big incentive to use the stuff that just works. Over my time in the business (late 1960s to today)\, we’ve gone from having no true standards\, to a couple of fairly standard programming languages (COBOL and FORTRAN)\, to having a fairly standard operating system (Unix/Linux) with a fairly standard programming language (C)\, to having fairly standard scripting languages (Perl\, PHP\, et al). The GNU project has been a remarkable success at standardization and has driven out a lot of proprietary technology (anybody remember the tiny compiler companies that used to exist?). On the network side\, we started with ftp\, graduated to bulletin boards\, upgraded to the web with HTML\, and now have HTML5. We still have plenty of proprietary technologies (iOS\, Windows\, BlackBerry OS\, to name but 3)\, but they must constantly battle to stay ahead of the march of fairly standard\, commonly-available software (e.g.\, Android). We even have some fairly standard application packages now (think GIMP). This trend will continue.

\

PG

From​: Linda Walsh [mailto​:perl-diddler@​tlinx.org] Sent​: Tuesday\, January 03\, 2012 6​:09 PM To​: perlbug-followup@​perl.org Subject​: Re​: [rt.cpan.org #73623] [perl #107326] perl's unicode conversion fails when iconv succeeds

Zefram via RT wrote​:

Linda A Walsh via RT wrote​:  

  Data is usually in the machines native format unless you are   specifically trying to export it somewhere  

That was the case in the 1980s. Times have changed; machines are more interconnected than they used to be. -zefram  


  This has NOT changed. It was addressed in the 1980's.

If you are Networking\, you used network byte order. If you are doing processing on the same machine\, you use native byte order.

To do otherwise is to incur horrible inefficiencies.

You can't do a string search on modern architectures USING their native instruction sets\, if you put data in an ALIEN format.

Intel has string compare assembly instructions that start at the beginning of a byte string\, and go from the start of the string\, in low memory\, (even on BE machines -- which is one reason they fell out of favor\, they were structurally flawed for string operations).

In the west\, we read from left to right\, so to list numbers\, we would put number in the order : 0 1 2 3 4 5 6 7 8 9 11 12 This is the same order that today's computers use. low(starting) memory on the left\, and you place bytes into memory in human readable order. If you look at memory\, you would see 0 1 2 3 4 5 ... (or 30 20 31 20 33 20\, where the 3x = the numbers\, and the 20= the spaces).

On a BE machine you don't know what you will see\, because the string is different depending on the word size used to store the string and the native word size of the machine. When the network standard was defined only 32-bit BE machines were at all prevalent. so as numbers\, if stored in NETWORD\, byte order\, (not necessarily BE order\, as BE is always relative to the word size.... I.e. if I packed them as an arrary\, at 16-bit intervals\, I would see​: 2 1 4 3 6 5 8 7 10 9 12 11. If I packed them into 32 bit words first\, I'd see 4 3 2 1 8 7 6 5 12 11 10 9. If I packed them into a 64bit word (we have 64-bit machines today)\, we'd see 8 7 6 5 4 3 2 1 0 0 0 0 12 11 10 9. If you packed them into a 1 byte array\, and looked at memory as bytes.\, you'd see the same order as you see in all 3 cases on a LE machine. That's the advantage and likely the biggest reason why LE machines are dominant today. It doesn't matter if you pack them in as bytes\, words\, 32-bit DWORDS\, or 64-bit QWORDS\, the order is the same.

So a BE machine talking to another BE machine of the same word size\, may benefit by putting them in BE order\, BUT\, the majority of computers used by consumers and in the IT world\, for processing are LE based. It makes no sense to default to a format that they can't use their native instruction set on without 'converting'.

You are choosing to deliberately create inefficience for most of the world\, to follow the example of the 1960's/1970's mainframes that are now extinct for a reason -- they didn't work well together\, and each was specialized ... Now we build things out of parts and build up\, so a subrouting getting a parameter\, doesn't care if you passed it 1 byte\, on a 8080\, or 2 bytes on an 8088\, or 4 bytes on a 586\, or 8 bytes on a x86_64 bit machine... ALL of those subroutines wold work -- unmodified\, since the low byte is always 1st in memory\, and that's what they pay attention to.

With BE machines\, no generation was compatible with the next. because the native byte order would be different at each word size.

Data over the internet for 4-byte or 16-byte addresses is in BE order because it makes sense for routing equipment that has to look at the high parts first\, for routing decisions\, just like you look at (country)(province/state)(city)(street)(street addr)[subnumber]. It is most efficient for address parsing in network equipment to be able to look at the most significant parts first. But humans? and Computers doing internal work? A human can never look at the 1st digit and make any sense of it. A computer can\, only if it knows the length of the item coming in (which is usually the case in a language (a sub that takes a byte\, a word\, or whatever)...

Perl doesn't represent or store strings in memory on todays machines in BE order unless it is running on a BE machine. It is an error for conversion of characters to default to non-native order\, as that's NEVER the _default_\, internally -- only in the circumstance of some explicit specification would it use non-native format. It just doesn't make any sense.

On top of all of the above\, Piconv was supposed to be a drop in replacement for iconv. It was meant to be a "demonstration of the unicode technology in Perl" -- it's a BIG FAIL\, if it doesn't generate the same output. It cannot be used as a drop in replacement due -- why? For the same reason why printf/sprintf aren't parallel on perl\, or the same reason why their is duplicate code in the perl interpreter for "use" and "require"\, so that the case for "use"\, can "special case" (quirk) functionality to "disallow" the same logical functionality that "require" possesses.

It's not cute\, and it's not just quirky\, it's simply harmful to anyone who might want perl to evolve into something that didn't have so many odd and non-intuitive exceptions.

How is it that you would want a document in a word order that is alien to your machine (when you are known to be exporting? why you'd convert to a non-native format when the next most likely thing to do would be to process that document locally?

Was it a particular 'screw you to Microsoft'? Who was the first major ventor to define and Use 16-bit values for unicode and who did so in 'LE'. Even apple went to Intel (though who knows if they will stay w/it)... but who uses BE machine and to whom\, is the current behavior/default useful to?

p5pRT commented 13 years ago

From perl-diddler@tlinx.org

Hi Paul\, thank you for your well written response.

It would take too much work to look for all the details to support every sentence I said\, but I will address some of the specifics.

No need to flame\, IMO\, but some people like it hot.

Green\, Paul wrote​:

At the risk of getting into a \ war\, let me say that the posting by Ms. Walsh contains many statements that I believe are inaccurate.
Anyone coming across that post in an archive in the future is advised to draw their own conclusions independent of her statements. I’d be grateful if she would cite sources. In my case\, I lived through this history and can offer my personal knowledge.

1. ā€œBig-Endian machines fell out of favor\, in part\, because they were structurally flawed for string operations.ā€


.... 59 digits of precision... um\, at ~ 2.3 digits/char\, that's about 25 whole chars of string length!. WOW!.. and how many books will fit in that????

  how about a right shift in memory by a byte? memmov\, can handle it\, and how well can a BE machine handle that? Each 32 or 16 bit workd .. or 60 bit word has to be unpacked\, shifted\, internally and propagated to the next word. It's a nightmare.

Thank you for making my point.

2. An implied assumption that little-endian machines have binary representations that are easier to read (say\, in a dump of storage) than big-endian machines.

Now...that's highly dependent on the type of data\, if it was *string* data\, which is what we are talking about\, organizing your dump to print from low on the left to high on the right\, you see things in alphabetical order. On a BE machine\, you'll not see a natural ordering because the strings will go DCBAHGFE instead of ABCDEFGH\, If I'm looking for strings in non-reversed mode\, I find abcdefgh ordering much easier to read that a byte-swapped order...of course with BE machines\, you had multiple types.

you could have BADCFEHG\, and today you'd more likely see HGFEDCBA. All of them played so nice together.

While an 16-bit 8086\, or a 32-bit 386\, .. no prob .. the 8086 just uses 2 words\, low+high aligned\, and the 386 used 1 DWORD\, no switching...they are inherently compatible with each other. If one side thinks it is passing an int\, and the other side only returns 8 significant bits (like exit)\, no problem\, -10 is -10 -- you pass -10 in a 32-bit register to a process expecting an 8-bit int\, no problem! They are automatically they are compatible\, but on a BE machine. Everything must match. you can't just look at the low bits and expect sanity. Sure\, everything 'should be perfect anyway'\, well\, we know how well that expectation works.

On a BE machine\, BUS errors were usually passed on to the program because you couldn't expect misaligned data to be read correctly\, because reading an int from bytes 0-3\, was very different than reading sn int that was stored at a horrid\, offset 1-4 in a struct. It may run slower\, but it has the flexibility to still run\, on a BE machine\, not an option.

Again\, this is silly. I spent years working on big-endian machines and had no trouble reading the dumps.


  I'm sure\, everyone learns their profession!

3. It makes no sense to use a different endian than native\, except in the network world.

I work on a very successful operating system (Stratus OpenVOS) in which all user data is big-endian\, yet the underlying processor is little-endian. The byte swaps add a negligible overhead\, thanks to clever optimizations by Intel. Our design makes perfect business and technical sense; it made the task of porting 25 years of source code from a big-endian processor (HP PA-RISC) to a little-endian processor (x86) a LOT easier for us and for our customers. I understand that Digital Equipment Corp played a similar trick when they ported FORTRAN from their Big-Endian 36-bit machines (PDP-10\, PDP-20) to their Little-Endian 16-bit/32-bit machines (PDP-11).

I Know nothing about it. I can only say that it would be more efficient if it didn't have to byte-switch\, ***BUT*** given the work required to convert it in software\, you might never recoup the money spent changing the software to be native ending.

4. The mainframes of the 60s and 70s are extinct because they were incompatible with each other.

I was there. They are extinct because they were hugely expensive\, not terribly reliable\, and over the years\, we’ve invented much better products.


  They were hugely expensive because they were all different -- i.e. not interchangeable\, -- not commodity parts\, couldn't substitute one for the other -- i.e. they were all incompatible. As PC's became dominant\, it forced the incompatible makers into smaller and smaller niche markets. Even Apple finally folded and went with Intel. You are agreeing with me! But you put out processors from Intel\, AMD\, and a few others\, and the binaries _can_ be portable between machines. You'd never expect that between BE machines... there were too many ways to be different. Too many ways to aligned that word\, but with LE\, there's only 1 way\, so already\, you've narrowed the compatibility tremendously.

They were incompatible with each other because (a) just making the darn things work was hard\, (b) there was no economic incentive to make them compatible\, (c) companies were reinventing the technology at a rapid clip and that required dropping the ideas that didn’t work out (the IBM System/360 was not compatible with the IBM 7094). Also\, (d) networking came along pretty late in the game.

All of those are also excellent reasons for expense and incompatibility. It's not just one reason\, I would agree\, and sorry if you feel I said that was the only reason.

If you take the years after World War II as the start of modern-day computers

Actually I wouldn't. The start of the first commercial computers was around the early 50's\, but I would consider the integrated circuit to be the start of modern computing\, since everything has 'sorta' been a shrink of that tech... and that was around 58-59. It took 6 years from there for ARPA to fund it's first network\, and ARPANET was first used in 69...

5. Data over the internet is in big-endian order because it is more efficient for routing.

I’d love to see your source for this comment. Again\, I was there. My memory is that the established machines (of the late 60s and early 70s\, when the ARPAnet was being invented) were Big-Endian. The upstarts were Little-Endian. I always thought that the designers just picked BE as the native format because that was the machine they were using at the same and were most familiar with. But I have only my memories for this\, not a source.


  I know...initially\, I thought the same thing\, and I was corrected\, by remembering my routing lessons -- especially when reading the routing about IPV6!
It may not have been much of a factor\, but with longer addressing\, except for those interested in auditing\, you'll find fewer ipv6 routers needing to look at the full address in order to do their function. I can't imagine the same wasn't true on a smaller scale with smaller machines\, but I didn't want to blame it all on the fact that larger machines were just more in vogue then! I was trying to give a little bit of credit to the design...??? But if what you say is true\, that's only another reason to NOT use BE order -- and choosing to default to BE order is a sign of anachronistic programming from the 70's! How that could find it's way into perl that didn't exist back then\, is beyond me!

In my opinion\, standardization in the computer industry is the sign that innovation has ceased. Or to put it another way\, the industry eventually decides that some piece of technology is good enough and there is no good reason to try to improve it.

Innovation ceases in an area when an local optimum has been reached. It becomes increasingly difficult to innovate when you are at or near a local optimum.

Eventually\, some sort of paradigm shift happens and major changes blow away layers of technology\,

That happens when someone finds a completely new way of doing something that creates (usually) a non-local optimum that is better than the current. Thus\, the usually accompanying major upheaval. A whole chapter or three goes into this in "Artificial Intelligence\, A Modern approach" (Russell and Norvig)\, it's also a standard feature in game theory/design.

but until then\, we have a big incentive to use the stuff that just works.


  Yeah... and in the particular\, MS\, the dominant OS on the planet and Intel\, the dominant architecture use LE\, so why would someone put something in BE -- and ordering that has all but died out along with the architectures that used them. Why would anyone put something in a dead architecture language by default\, on a LE machine.

  Think of a string compare... -- you just increment a pointer on both...but on a BE machine\, you have to unpack it to get the order right. Same with UTF16BE... guaranteed to have to unpack it to get the order right\, but it if is UTF16LE\, then as long as you don't run out of the first 64K\, you can just use an incremental 16-bit compare. Code plane usage isn't that frequent with western chars (unfortunately\, or MS would have them working better in Win7! a giant leap backward from XP for font support -- unbelievable!)....

\

That was supposed to be a flame? Naw....flames are when you set about to burn the other person...I didn't get that impression\, disagreeing?
Common. But you weren't sufficiently rude\, obnoxious or berating....you'll really have to work on that... ;-)

Oh\, and in case I wasn't\, your mamma wears army boots\, so there. (gettin' into serious flamage here!) -linda