Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.95k stars 554 forks source link

Silent encoding of filenames with UTF8 flag set #15305

Open p5pRT opened 8 years ago

p5pRT commented 8 years ago

Migrated from rt.perl.org#128083 (status was 'open')

Searchable as RT128083$

p5pRT commented 8 years ago

From @hakonhagland

According to perldoc "perlunicode"\, section​: "When Unicode Does Not Happen"

https://metacpan.org/pod/distribution/perl/pod/perlunicode.pod#When-Unicode-Does-Not-Happen

Site​: "There are still many places where Unicode (in some encoding or another) could be given as arguments or received as results\, or both in Perl\, but it is not"

Then a set of interfaces are listed\, including system() and mkdir()\, where the above statement applies.

Still\, it is my experience that the above statement is not strictly correct\, in the sense that Perl will encode (as UTF8) any Perl string with the UTF8 flag set that is input to these interfaces. For example\, consider system() :

use strict; use utf8; use warnings; use Encode ();

# This sets the UTF8 flag on $str due to "use utf8" pragma and makes $str consist # of one char with ordinal value E5 my $str = 'å';

# This clears the UTF8 flag on $str_utf8\, makes it a binary string # of two bytes​: C3 A5 my $str_utf8 = Encode​::encode_utf8( $str );

# Argument to system()\, UTF8 flag is set for $arg due to interpolation of $str my $arg = "echo -n '$str' | hexdump -C";

# system() scilently encodes $arg as UTF8 system $arg;

# Argument to system()\, UTF8 flag is not set for $arg2 my $arg2 = "echo -n '$str_utf8' | hexdump -C";

# system() does nothing with $str_utf8 (since it has no UTF8 flag set) system $arg2;

The output of the above script is​:

00000000 c3 a5 |..| 00000002 00000000 c3 a5 |..| 00000002

Which confirms that $arg was silently encoded as UTF8 before passed to /bin/sh. My concern is that this type of encoding seems to be undocumented (at least I have not found any reference to it in the docs)\, and I wonder what the official recommendation would be​:

1. Always encode explicitly arguments passed to system()\, mkdir()\, chdir()\, ...\, or 2. It is not necessary to encode arguments; one can always trust that the arguments will be encoded correctly by   the given function.

If 2) is the recommendation\, then perhaps it should be documented somewhere (assuming I did not miss that part of the docs). The docs should mention that these interfaces will encode input arguments. However\, note​: I have come across one CPAN modules that *require* the user to encode the input argument​: File​::Find​::Rule\, (and therefore I assume there probably exists other modules). So if 2) is recommended there would be some "inconsistency"\, in the sense that\, mkdir $name would not require $name to be enocded\, but File​::Find​::Rule->new->name( $name ) would require the user to first encode $name.

Best regards\, Håkon Hægland

p5pRT commented 8 years ago

From @iabyn

On Fri\, May 06\, 2016 at 08​:27​:18AM -0700\, Håkon Hægland wrote​:

# New Ticket Created by Håkon Hægland # Please include the string​: [perl #128083] # in the subject line of all future correspondence about this issue. # \<URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=128083 >

According to perldoc "perlunicode"\, section​: "When Unicode Does Not Happen"

https://metacpan.org/pod/distribution/perl/pod/perlunicode.pod#When-Unicode-Does-Not-Happen

Site​: "There are still many places where Unicode (in some encoding or another) could be given as arguments or received as results\, or both in Perl\, but it is not"

Then a set of interfaces are listed\, including system() and mkdir()\, where the above statement applies.

Still\, it is my experience that the above statement is not strictly correct\, in the sense that Perl will encode (as UTF8) any Perl string with the UTF8 flag set that is input to these interfaces. For example\, consider system() : [snip] I have not found any reference to it in the docs)\, and I wonder what the official recommendation would be​:

1. Always encode explicitly arguments passed to system()\, mkdir()\, chdir()\, ...\, or 2. It is not necessary to encode arguments; one can always trust that the arguments will be encoded correctly by the given function.

Perl's system() etc do not do any form of encoding - they just pass the physical bytes which make up the string directly to the underlying C library function as-is\, without consideration as to whether the scalar's UTF8 flag is on or not​: this​:

  $s = "\x80";   #utf8​::upgrade($s);   system "echo '$s' | hexdump -C";

outputs​:

  00000000 80 0a

while uncommenting the utf8​::upgrade gives​:

  00000000 c2 80 0a

-- Red sky at night - gerroff my land! Red sky at morning - gerroff my land!   -- old farmers' sayings #14

p5pRT commented 8 years ago

The RT System itself - Status changed from 'new' to 'open'