Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.99k stars 557 forks source link

utf8 problems (still, ) #12354

Closed p5pRT closed 12 years ago

p5pRT commented 12 years ago

Migrated from rt.perl.org#114602 (status was 'resolved')

Searchable as RT114602$

p5pRT commented 12 years ago

From @ribasushi

On Fri\, Aug 31\, 2012 at 12​:55​:29PM -0700\, Linda W wrote​:

Nicholas Clark via RT wrote​:

Having perl knowingly do the wrong thing (as it does now)\,or having it die altogether when it has a good idea of what the user wants -- cannot be called as something "serving backwards compatibility".

It's not *unambiguously* doing the wrong thing.

What sequence of octets should this output?

perl -le 'print chr 233' --- If the user has said\, interpret this as a character -- which is what the code appears to do\, it should print it out as code point \u00e9.

Hi Linda\,

The user doesn't "say" anything when using chr(). Its behavior is hardcoded and what is more important - it is not subject to change\, because of the massive problems that may cause. To remove any ambiguity let's try from a different angle. If I was to write a drop in replacement for chr() based on pack() it would look like this (sans the prototypes)​:

sub chr {   my $source = @​_ ? $_[0] : $_;

  # some handwavy check for $source being a number

  if ($source \< 0) {   # handle negative numbers based on the current scope-state   # handwave this away as well\, as not relevant to discussion   }   elsif ($source \< 256) {   return pack 'C'\, $source;   }   else {   return pack 'C0U'\, $source;   } }

Please let us know if we are all on the same page as far as the *current behavior* is concerned. If we are\, then you are more than welcome to make another attempt to make your case on why this behavior is wrong.

Cheers

p5pRT commented 12 years ago

From Mark@Overmeer.net

* Darin McBride (dmcbride@​cpan.org) [120831 22​:34]​:

On Fri\, Aug 31\, 2012\, at 14​:58\, Eric Brine wrote​: As for pipes... hmmm... character device?

perl -MLWP​::Simple -e 'getprint
"http​://cpan.metacpan.org/authors/id/R/RJ/RJBS/perl-5.16.1.tar.gz"' | \ tar xvzf -

STDOUT is a pipe. But isn't spitting out characters.

Many "character devices" do transport characters\, but it is a mis-naming. UNIX device drivers are split into two groups

  1 - "character devices" where the bytes are always processed in-order.   It would be unpracticle to see characters get mixed-up on the   screen. Seeking back in a keyboard stream is also not possible.   Also tape.

  2 - "block devices" which permit random access to the byte-stream\,   for instance hard-disks which blocks can be read and written in   random order.

So\, the same ambiguity applies here just as with files or sockets.

Pipes and sockets are not devices\, but close. They behave similar to character devices. A file is also not a device\, but on a "block device" so with its random-access features.

On Fri\, Aug 31\, 2012 at 3​:55 PM\, Linda W \perl\-diddler@&#8203;tlinx\.org wrote​:

When perl writes to a character device\, it encodes wide output into characters. When perl writes to a block device -- it doesn't.

No no no. Do not get confused by the misleading name of the device group. UNIX on OS level only has bytes\, some of which may mean something when sent to a screen driver\, other more useful when started as a program. -- Regards\,

  MarkOv


  Mark Overmeer MSc MARKOV Solutions   Mark@​Overmeer.net solutions@​overmeer.net http​://Mark.Overmeer.net http​://solutions.overmeer.net

p5pRT commented 12 years ago

From perl-diddler@tlinx.org

Mark Overmeer via RT wrote​:

On Fri\, Aug 31\, 2012 at 3​:55 PM\, Linda W \perl\-diddler@&#8203;tlinx\.org wrote​:

When perl writes to a character device\, it encodes wide output into characters. When perl writes to a block device -- it doesn't.

No no no. Do not get confused by the misleading name of the device group. UNIX on OS level only has bytes\, some of which may mean something when sent to a screen driver\, other more useful when started as a program.

It's not about the device name\, but whether or not it's hooked up to STD\<IN/OUT/ERR>.

Meaning if I really want a perl prog to do diskcopy w/STDIN/OUT\, I better make sure I tell STDIN/OUT to use binary.

If you want binary from STDIN/STDOUT -- set a binary mode. Else\, it I think it is more than fair to expect possible corruption -- as that's the case on many OS's already.

If I do a pipe manually between processes -- using pipe -- no conversion -- if I'm using default STDIN/STDOUT... if I open my own file handle -- I need to specify.

Non-default locations\, using explicit opens -- I wouldn't see as getting "default" encoding treatment. I'm trying to look only at the default simple case.

Does making that explicit\, abate or sufficiently minimize compatibility risks?