Open p5pRT opened 14 years ago
This snippet calls rename with two different paths\, even though the same string is passed to rename.
perl -e 'my $x = chr 200; rename $x\,0; utf8::encode $x; rename $x\,0'
The fact that the internal (basically invisible to a perl program) encoding changes should not change semantics of I/O functions.
The solution is to use the equivalent of SvPVbyte\, not SvPV\, when passing paths (or other 8b-it data) to posix functions.
A cursory examination of pp_sys shows that at least backtick\, open\, dbmopen\, sysopen\, truncate\, bind\, setsockopt\, getsockopt\, getpeername\, stat\, chdir\, chroot\, link\, readlink\, mkdir\, rmdir\, opendir\, system\, exec\, gethost*\, getproto*\, getserv* etc. are affected (I stopped looking).
All those functions silently throw away the crucial information of how bytes are encoded in a string. As modules and programs using unicode become more common\, this problem will become a major issue.
(When in doubt\, it always helps to review the discussion about crypt() which was fixed during 5.006 times).
On Sun\, Sep 12\, 2010 at 5:15 AM\, perlbug@plan9.de \<perlbug-followup@perl.org
wrote:
# New Ticket Created by perlbug@plan9.de # Please include the string: [perl #77798] # in the subject line of all future correspondence about this issue. # \<URL: http://rt.perl.org/rt3/Ticket/Display.html?id=77798 >
This is a bug report for perl from perlbug@plan9.de\, generated with the help of perlbug 1.39 running under perl 5.10.1.
----------------------------------------------------------------- [Please describe your issue here]
This snippet calls rename with two different paths\, even though the same string is passed to rename.
perl -e 'my $x = chr 200; rename $x\,0; utf8::encode $x; rename $x\,0'
$x and $x after utf8::encode($x) are not the same string. (They're not even the same length.)
But there is a bug here. $x after utf8::upgrade and $x after utf8::downgrade are the same string\, but they're not treated as such.
$ perl -e'$_=chr(0xE9); utf8::upgrade($_); rename "a"\,$_' $ perl -e'$_=chr(0xE9); utf8::downgrade($_); rename "b"\,$_' $ ls Ă© ?
The solution is to use the equivalent of SvPVbyte\, not SvPV\, when passing
Correct.
The RT System itself - Status changed from 'new' to 'open'
On Sun\, Sep 12\, 2010 at 01:23:42PM -0400\, Eric Brine \ikegami@​adaelis\.com wrote:
Sorry for the late reply\, but\, again\, I never received your mail becasue it wasn't directed at me\, so I just saw it "by accident" by looking at p5p.
This snippet calls rename with two different paths\, even though the same string is passed to rename.
perl -e 'my $x = chr 200; rename $x\,0; utf8::encode $x; rename $x\,0'
$x and $x after utf8::encode($x) are not the same string. (They're not even the same length.)
Yes\, while condensing the testcase as much as possible I accidentally swapped upgrade with encode. In any case\, the problem remains the same\, namely perl ignoring the utf-8 flag for many of it's system interfaces\, and the ExtUtils typemap\, which breaks many xs modules.
-- The choice of a Deliantra\, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_\,_/ /_/\_\
What if this were solved by creating a sysbinmode
built-in that served the same purpose as binmode
for filehandles?
That way Perl applications could set an I/O layer for rename
et al. And Perlâs default would change to the same behaviour as filehandlesâbasically SvPVbyte
.
What if this were solved by creating a sysbinmode built-in that served the same purpose as binmode for filehandles?
What scope would that have?
What scope would that have?
Global, I guess? Could alternatively make it a pragma, e.g., use sysbinmode "UTF-8"
.
I think global would be wrong, because that means code can't make any assumptions of its own anymore. I immediately recall php code full of "if add_slashes is globally enabled do this, other wise do that" code.
@Leont Global, yes, feels wrong.
But if I could:
use sysbinmode ':utf8';
my $foo = "Ă©";
exec 'echo', $foo;
⊠and have that auto-encode the same way binmode $fh, ':utf8'
does, that would seem a reasonable fix?
See also https://github.com/Perl/perl5/issues/17094#issuecomment-745762592 (the ticket is about win32, but tony's proposal is for all platforms).
@xenu For myself, I actually want to go the other way: SvPVbyte rather than SvPVutf8.
@Leont @xenu ^^ Thoughts on the above proof-of-concept?
On unix systems, file names are composed of arbitrary bytes, which two having specific values: 0x00 reserved to denote end of string, and 0x2F directory separator. ("/" is 0x2F in EBCDIC encodings too!) There's no guarantee of being UTF-8 or some other encoding, no matter what the locale says.
On Windows file systems, file names are sequences of arbitrary 16-bit values expected to be UTF-16le, but it's surely possible to have unmatched surrogates and invalid characters such as 0xFFFF.
If we want Perl to be able to round-trip any file name (e.g. readdir -> rename), there are two options.
Return/accept arbitrary sequences of 8-bit values (unix) or 16-bit values (Windows), no matter how they are store (upgraded or downgraded).
Decode/encode returned/accepted file names (using locale in unix) in such a way that any sequence can be created. See this for an example of such a system.
Current status:
Unix: Returns/accepts arbitrary sequences of 8-bit values. This means that any file name can be round-tripped. The internal storage of the string rather than the string itself.
Windows: Returns/accepts the file name encoded using the system's Active/ANSI Code Page. Most file names can't be returned or accepted by Perl (without using modules instead of builtin functions).
I'm trying to get the ACP for Perl's process changed to 65001 (UTF-8) for Strawberry Perl. See the issue I raised.
use sysbinmode ':utf8';
I would love to see decoded files names (option 2 above) , and a pragma would be required to do so, but having to provide an encoding is bad. The correct encoding should be used. The pragma could allow one to specify errors.
This deals with the problem of upgraded/downgraded strings meaning different filesystem paths: https://metacpan.org/pod/Sys::Binmode
It doesnât address Windows, but AFAIK it doesnât worsen the Windows situation, either.
Migrated from rt.perl.org#77798 (status was 'open')
Searchable as RT77798$