Perl / perl5

đŸȘ The Perl programming language
https://dev.perl.org/perl5/
Other
1.9k stars 540 forks source link

rename, chroot etc. ignore internal encoding #10623

Open p5pRT opened 14 years ago

p5pRT commented 14 years ago

Migrated from rt.perl.org#77798 (status was 'open')

Searchable as RT77798$

p5pRT commented 14 years ago

From perlbug@plan9.de

Created by perlbug@plan9.de

This snippet calls rename with two different paths\, even though the same string is passed to rename.

  perl -e 'my $x = chr 200; rename $x\,0; utf8​::encode $x; rename $x\,0'

The fact that the internal (basically invisible to a perl program) encoding changes should not change semantics of I/O functions.

The solution is to use the equivalent of SvPVbyte\, not SvPV\, when passing paths (or other 8b-it data) to posix functions.

A cursory examination of pp_sys shows that at least backtick\, open\, dbmopen\, sysopen\, truncate\, bind\, setsockopt\, getsockopt\, getpeername\, stat\, chdir\, chroot\, link\, readlink\, mkdir\, rmdir\, opendir\, system\, exec\, gethost*\, getproto*\, getserv* etc. are affected (I stopped looking).

All those functions silently throw away the crucial information of how bytes are encoded in a string. As modules and programs using unicode become more common\, this problem will become a major issue.

(When in doubt\, it always helps to review the discussion about crypt() which was fixed during 5.006 times).

Perl Info ``` Flags: category=core severity=medium Site configuration information for perl 5.10.1: Configured by Marc Lehmann at Wed May 5 10:53:04 CEST 2010. Summary of my perl5 (revision 5 version 10 subversion 1) configuration: Platform: osname=linux, osvers=2.6.26-2-amd64, archname=amd64-linux uname='linux cerebro 2.6.26-2-amd64 #1 smp thu nov 5 02:23:12 utc 2009 x86_64 gnulinux ' config_args='-Duselargefiles -Dxxxxuse64bitint -Uuse64bitall -Dusemymalloc=n -Dcc=gcc -Dccflags=-ggdb -gdwarf-2 -g3 -Dcppflags=-DPERL_ARENA_SIZE=65536 -D_GNU_SOURCE -I/opt/include -Doptimize=-O6 -funroll-loops -fno-strict-aliasing -Dcccdlflags=-fPIC -Dldflags=-L/opt/perl/lib -L/opt/lib -Dlibs=-ldl -lm -lcrypt -Darchname=amd64-linux -Dprefix=/opt/perl -Dprivlib=/opt/perl/lib/perl5 -Darchlib=/opt/perl/lib/perl5 -Dvendorprefix=/opt/perl -Dvendorlib=/opt/perl/lib/perl5 -Dvendorarch=/opt/perl/lib/perl5 -Dsiteprefix=/opt/perl -Dsitelib=/opt/perl/lib/perl5 -Dsitearch=/opt/perl/lib/perl5 -Dsitebin=/opt/perl/bin -Dman1dir=/opt/perl/man/man1 -Dman3dir=/opt/perl/man/man3 -Dsiteman1dir=/opt/perl/man/man1 -Dsiteman3dir=/opt/perl/man/man3 -Dman1ext=1 -Dman3ext=3 -Dpager=/usr/bin/less -Uafs -Uusesfio -Uusenm -Uuseshrplib -Ud_dosuid -Dusethreads=undef -Duse5005threads=undef -Duseithreads=undef -Dusemultiplicity=undef -Demail=perl-binary@plan9.de -Dcf_email=perl-binary@plan9.de -Dcf_by=Marc Lehmann -Dlocincpth=/opt/perl/include /opt/include -Dmyhostname=localhost -Dmultiarch=undef -Dbin=/opt/perl/bin -Dxxxusedevel -DxxxDEBUGGING -Dxxxuse_debugging_perl -Dxxxuse_debugmalloc -des' hint=recommended, useposix=true, d_sigaction=define useithreads=undef, usemultiplicity=undef useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=define, use64bitall=undef, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='gcc', ccflags ='-ggdb -gdwarf-2 -g3 -fno-strict-aliasing -pipe -fstack-protector -I/opt/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O6 -funroll-loops -fno-strict-aliasing', cppflags='-DPERL_ARENA_SIZE=65536 -D_GNU_SOURCE -I/opt/include -ggdb -gdwarf-2 -g3 -fno-strict-aliasing -pipe -fstack-protector -I/opt/include' ccversion='', gccversion='4.3.2', gccosandvers='' intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='gcc', ldflags ='-L/opt/perl/lib -L/opt/lib -fstack-protector -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib /lib64 /usr/lib64 libs=-ldl -lm -lcrypt perllibs=-ldl -lm -lcrypt libc=/lib/libc-2.10.2.so, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='2.10.2' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -O6 -funroll-loops -fno-strict-aliasing -L/opt/perl/lib -L/opt/lib -L/usr/local/lib -fstack-protector' Locally applied patches: @INC for perl 5.10.1: /root/src/sex /opt/perl/lib/perl5 /opt/perl/lib/perl5 /opt/perl/lib/perl5 /opt/perl/lib/perl5 . Environment for perl 5.10.1: HOME=/root LANG (unset) LANGUAGE (unset) LC_CTYPE=en_US.UTF-8 LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/root/s2:/root/s:/opt/bin:/opt/sbin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/X11/bin:/usr/games:/usr/local/bin:/usr/local/sbin:/root/pserv:. PERL5LIB=/root/src/sex PERL5_CPANPLUS_CONFIG=/root/.cpanplus/config PERLDB_OPTS=ornaments=0 PERL_ANYEVENT_DBI_TESTS=1 PERL_ANYEVENT_EDNS0=1 PERL_ANYEVENT_NET_TESTS=1 PERL_ANYEVENT_PROTOCOLS=ipv4,ipv6 PERL_ANYEVENT_STRICT=1 PERL_BADLANG (unset) PERL_UNICODE=E SHELL=/bin/bash ```
p5pRT commented 14 years ago

From @ikegami

On Sun\, Sep 12\, 2010 at 5​:15 AM\, perlbug@​plan9.de \<perlbug-followup@​perl.org

wrote​:

# New Ticket Created by perlbug@​plan9.de # Please include the string​: [perl #77798] # in the subject line of all future correspondence about this issue. # \<URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=77798 >

This is a bug report for perl from perlbug@​plan9.de\, generated with the help of perlbug 1.39 running under perl 5.10.1.

----------------------------------------------------------------- [Please describe your issue here]

This snippet calls rename with two different paths\, even though the same string is passed to rename.

perl -e 'my $x = chr 200; rename $x\,0; utf8​::encode $x; rename $x\,0'

$x and $x after utf8​::encode($x) are not the same string. (They're not even the same length.)

But there is a bug here. $x after utf8​::upgrade and $x after utf8​::downgrade are the same string\, but they're not treated as such.

$ perl -e'$_=chr(0xE9); utf8​::upgrade($_); rename "a"\,$_' $ perl -e'$_=chr(0xE9); utf8​::downgrade($_); rename "b"\,$_' $ ls Ă© ?

The solution is to use the equivalent of SvPVbyte\, not SvPV\, when passing

Correct.

p5pRT commented 14 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 13 years ago

From schmorp@schmorp.de

On Sun\, Sep 12\, 2010 at 01​:23​:42PM -0400\, Eric Brine \ikegami@&#8203;adaelis\.com wrote​:

Sorry for the late reply\, but\, again\, I never received your mail becasue it wasn't directed at me\, so I just saw it "by accident" by looking at p5p.

This snippet calls rename with two different paths\, even though the same string is passed to rename.

perl -e 'my $x = chr 200; rename $x\,0; utf8​::encode $x; rename $x\,0'

$x and $x after utf8​::encode($x) are not the same string. (They're not even the same length.)

Yes\, while condensing the testcase as much as possible I accidentally swapped upgrade with encode. In any case\, the problem remains the same\, namely perl ignoring the utf-8 flag for many of it's system interfaces\, and the ExtUtils typemap\, which breaks many xs modules.

--   The choice of a Deliantra\, the free code+content MORPG   -----==- _GNU_ http​://www.deliantra.net   ----==-- _ generation   ---==---(_)__ __ ____ __ Marc Lehmann   --==---/ / _ \/ // /\ \/ / schmorp@​schmorp.de   -=====/_/_//_/\_\,_/ /_/\_\

FGasper commented 3 years ago

What if this were solved by creating a sysbinmode built-in that served the same purpose as binmode for filehandles?

That way Perl applications could set an I/O layer for rename et al. And Perl’s default would change to the same behaviour as filehandles—basically SvPVbyte.

Leont commented 3 years ago

What if this were solved by creating a sysbinmode built-in that served the same purpose as binmode for filehandles?

What scope would that have?

FGasper commented 3 years ago

What scope would that have?

Global, I guess? Could alternatively make it a pragma, e.g., use sysbinmode "UTF-8".

Leont commented 3 years ago

I think global would be wrong, because that means code can't make any assumptions of its own anymore. I immediately recall php code full of "if add_slashes is globally enabled do this, other wise do that" code.

FGasper commented 3 years ago

@Leont Global, yes, feels wrong.

But if I could:

use sysbinmode ':utf8';

my $foo = "Ă©";
exec 'echo', $foo;


 and have that auto-encode the same way binmode $fh, ':utf8' does, that would seem a reasonable fix?

xenu commented 3 years ago

See also https://github.com/Perl/perl5/issues/17094#issuecomment-745762592 (the ticket is about win32, but tony's proposal is for all platforms).

FGasper commented 3 years ago

@xenu For myself, I actually want to go the other way: SvPVbyte rather than SvPVutf8.

FGasper commented 3 years ago

@Leont @xenu ^^ Thoughts on the above proof-of-concept?

ikegami commented 3 years ago

On unix systems, file names are composed of arbitrary bytes, which two having specific values: 0x00 reserved to denote end of string, and 0x2F directory separator. ("/" is 0x2F in EBCDIC encodings too!) There's no guarantee of being UTF-8 or some other encoding, no matter what the locale says.

On Windows file systems, file names are sequences of arbitrary 16-bit values expected to be UTF-16le, but it's surely possible to have unmatched surrogates and invalid characters such as 0xFFFF.

If we want Perl to be able to round-trip any file name (e.g. readdir -> rename), there are two options.

  1. Return/accept arbitrary sequences of 8-bit values (unix) or 16-bit values (Windows), no matter how they are store (upgraded or downgraded).

  2. Decode/encode returned/accepted file names (using locale in unix) in such a way that any sequence can be created. See this for an example of such a system.

Current status:

use sysbinmode ':utf8';

I would love to see decoded files names (option 2 above) , and a pragma would be required to do so, but having to provide an encoding is bad. The correct encoding should be used. The pragma could allow one to specify errors.

FGasper commented 3 years ago

This deals with the problem of upgraded/downgraded strings meaning different filesystem paths: https://metacpan.org/pod/Sys::Binmode

It doesn’t address Windows, but AFAIK it doesn’t worsen the Windows situation, either.