Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.9k stars 540 forks source link

Unicode readdir bugs #11513

Open p5pRT opened 13 years ago

p5pRT commented 13 years ago

Migrated from rt.perl.org#95160 (status was 'open')

Searchable as RT95160$

p5pRT commented 13 years ago

From tchrist@perl.com

I'm really rather unhappy with the what you see isn't what you get approach Perl is taking here.

Consider this​:

  #!/usr/bin/env perl   use v5.12;   use utf8;   use strict;   use autodie;   use warnings;   binmode(STDOUT\, "​:utf8");   binmode(STDERR\, "​:utf8");   END { close STDOUT }   my @​στιγματα = qw( ΣΤΙΓΜΑΣ στιγμασ στιγμας );   for my $στιγμα (@​στιγματα) {   my $fh;   open $fh\, "> :utf8"\, $στιγμα;   say $fh "στιγμα";   close $fh;   }   opendir(my $dh\, ".");   while (readdir($dh)) {   say if /\P{ASCII}/;   }   closedir($dh);

Run on Linux\, I get this nonsense​:

  στιγμας   στιγμασ   ΣΤΙΓΜΑΣ

Run on Darwin\, I get this\, which is even worse​:

  στιγμας   ΣΤΙΓΜΑΣ

*Who* told Perl it was ok to let me blithely use wide characters in creat but then forbad me from using them in readdir? That's stupid. Perl should forbid unencoded wide characters in syscalls. It already does in syswrite. Why not here?

Yes\, if I make my loop

  while (my $enc = readdir($dh)) {   use Encode qw(decode);   $_ = decode "UTF-8"\, $enc;   say if /\P{ASCII}/;   }

Then I get

  στιγμας   στιγμασ   ΣΤΙΓΜΑΣ

on Linux and

  στιγμας   ΣΤΙΓΜΑΣ

on Darwin.

But that's nutty\, and in several ways.

First off\, Darwin's case-insensitive filesytem is an idiot\, and doesn't work correctly. Notice how it not doing casefolding correctly. It let me create two files that are casefolds of each other\, even though all three are such.

But secondly and of greater importance\, I should be able to do something like​:

  binmode($dh\, "​:utf8");

or even

  opendir(my $dh\, "​:utf8"\, ".");

And not have to deal with this really really stupid encoding business.

Is there reason that this is not a bug that should be fixed?

And don't even get me started about glob(). It's broken\, too. Have fun with HFS+'s quasi-NFD filesystem\, eh?

--tom

Summary of my perl5 (revision 5 version 14 subversion 0) configuration​:  
  Platform​:   osname=openbsd\, osvers=4.4\, archname=OpenBSD.i386-openbsd   uname='openbsd chthon 4.4 generic#0 i386 '   config_args='-des'   hint=recommended\, useposix=true\, d_sigaction=define   useithreads=undef\, usemultiplicity=undef   useperlio=define\, d_sfio=undef\, uselargefiles=define\, usesocks=undef   use64bitint=undef\, use64bitall=undef\, uselongdouble=undef   usemymalloc=y\, bincompat5005=undef   Compiler​:   cc='cc'\, ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'\,   optimize='-O2'\,   cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'   ccversion=''\, gccversion='3.3.5 (propolice)'\, gccosandvers='openbsd4.4'   intsize=4\, longsize=4\, ptrsize=4\, doublesize=8\, byteorder=1234   d_longlong=define\, longlongsize=8\, d_longdbl=define\, longdblsize=12   ivtype='long'\, ivsize=4\, nvtype='double'\, nvsize=8\, Off_t='off_t'\, lseeksize=8   alignbytes=4\, prototype=define   Linker and Libraries​:   ld='cc'\, ldflags ='-Wl\,-E -fstack-protector -L/usr/local/lib'   libpth=/usr/local/lib /usr/lib   libs=-lgdbm -lm -lutil -lc   perllibs=-lm -lutil -lc   libc=/usr/lib/libc.so.48.0\, so=so\, useshrplib=false\, libperl=libperl.a   gnulibc_version=''   Dynamic Linking​:   dlsrc=dl_dlopen.xs\, dlext=so\, d_dlsymun=undef\, ccdlflags=' '   cccdlflags='-DPIC -fPIC '\, lddlflags='-shared -fPIC -L/usr/local/lib -fstack-protector'

Characteristics of this binary (from libperl)​:   Compile-time options​: MYMALLOC PERL_DONT_CREATE_GVSV PERL_MALLOC_WRAP   PERL_PRESERVE_IVUV USE_LARGE_FILES USE_PERLIO   USE_PERL_ATOF   Built under openbsd   Compiled at Jun 11 2011 11​:48​:28   %ENV​:   PERL_UNICODE="SA"   @​INC​:   /usr/local/lib/perl5/site_perl/5.14.0/OpenBSD.i386-openbsd   /usr/local/lib/perl5/site_perl/5.14.0   /usr/local/lib/perl5/5.14.0/OpenBSD.i386-openbsd   /usr/local/lib/perl5/5.14.0   /usr/local/lib/perl5/site_perl/5.12.3   /usr/local/lib/perl5/site_perl/5.11.3   /usr/local/lib/perl5/site_perl/5.10.1   /usr/local/lib/perl5/site_perl/5.10.0   /usr/local/lib/perl5/site_perl/5.8.7   /usr/local/lib/perl5/site_perl/5.8.0   /usr/local/lib/perl5/site_perl/5.6.0   /usr/local/lib/perl5/site_perl/5.005   /usr/local/lib/perl5/site_perl   .

p5pRT commented 13 years ago

From @ikegami

On Tue\, Jul 19\, 2011 at 2​:39 PM\, tchrist1 \perlbug\-followup@​perl\.org wrote​:

# New Ticket Created by tchrist1 # Please include the string​: [perl #95160] # in the subject line of all future correspondence about this issue. # \<URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=95160 >

I'm really rather unhappy with the what you see isn't what you get approach Perl is taking here.

Consider this​:

#!/usr/bin/env perl use v5.12; use utf8; use strict; use autodie; use warnings; binmode(STDOUT\, "​:utf8"); binmode(STDERR\, "​:utf8"); END { close STDOUT } my @​στιγματα = qw( ΣΤΙΓΜΑΣ στιγμασ στιγμας ); for my $στιγμα (@​στιγματα) { my $fh; open $fh\, "> :utf8"\, $στιγμα; say $fh "στιγμα"; close $fh; } opendir(my $dh\, "."); while (readdir($dh)) { say if /\P{ASCII}/; } closedir($dh);

Run on Linux\, I get this nonsense​:

στιγμας στιγμασ ΣΤΙΓΜΑΣ

Just like​:

  - Input from STDIN must be decoded.   - Output to STDOUT and STDERR must be encoded.

This applies​:

  - Input from @​ARGV and file names from builtins must be decoded.   - File names passed to builtins must be encoded.

You can get away with not doing the fourth because you have an UTF-8 locale and C\ suffers from The Unicode Bug.

- Eric

p5pRT commented 13 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 13 years ago

From tchrist@perl.com

  +---------------------------------------------------------------------+   | This is automatic mail from Tom Christiansen's answering service. |   | UPDATED​: Mon Jul 18 12​:44​:53 MDT 2011 |   +---------------------------------------------------------------------+

  I've got six weeks of work to do in that number of days\, so I'll be   be almost entirely out of email-touch during that time. From July 25-29th   I'll be attending OSCON in Portland\, and over weekend of August 12-14th   I'll be attending my high-school reunion in Lake Geneva--and celebrating   being done with updating Programming Perl. Cue the fireworks.

  In the meanwhile\, I will not in general be reading\, let alone answering\,   any incoming email. The five exceptions are as follows\, with suggested   tags to add to the subject line to make sure I notice them​:

#1​: Life-and-death situations -- why are you using email for that?   (e.g.) Subject​: [1=DYING] blah blah blah

#2​: Personal family matters of my own relations -- again\, try the phone.   (e.g.) Subject​: [2=FAMILY] blah blah blah

#3​: Issues @​work w/my University textmining job *INVOLVING ME PERSONALLY*.   (e.g.) Subject​: [3=WORK] blah blah blah

#4​: Prepping my 4.5h of Unicode talks for next week's conference in Portland.   (e.g.) Subject​: [4=OSCON] blah blah blah

#5​: Prepping a kilopage of the Camel Book's 4th ed. for Production by mid-August.   (e.g.) Subject​: [5=BOOK] blah blah blah

  Because I will *disconnecting my laptop from the Internet* so I can   actually get something done\, I'll be answering mail twice a day *only*​:

  1) once cheerfully between 5-7am MDT (UTC-0600)   2) once perhaps rather grumpily between 6-8pm MDT (UTC-0600)

  I'm a morning person\, so those are the only two choices you're liable to   get​: gleeful or glowering\, with little middle ground. I expect to answer   no mail outside those five special categories listed above until well into   August. It **might** happen\, but never count on it.

  Thank you for your forebearance.

  --tom

p5pRT commented 12 years ago

From @cpansprout

On Tue Jul 19 11​:39​:04 2011\, tom christiansen wrote​:

I'm really rather unhappy with the what you see isn't what you get approach Perl is taking here.

Consider this​:

\#\!/usr/bin/env perl
use v5\.12;
use utf8;
use strict;
use autodie;
use warnings;
binmode\(STDOUT\, "&#8203;:utf8"\);
binmode\(STDERR\, "&#8203;:utf8"\);
END \{ close STDOUT  \}
my @&#8203;στιγματα = qw\( ΣΤΙΓΜΑΣ στιγμασ στιγμας \);
for my $στιγμα \(@&#8203;στιγματα\) \{
    my $fh;
    open $fh\, "> :utf8"\, $στιγμα;
    say $fh "στιγμα";
    close $fh;
\}
opendir\(my $dh\, "\."\);
while \(readdir\($dh\)\) \{
    say if /\\P\{ASCII\}/;
\}
closedir\($dh\);

Run on Linux\, I get this nonsense​:

στιγμας
στιγμασ
ΣΤΙΓΜΑΣ

Run on Darwin\, I get this\, which is even worse​:

στιγμας
ΣΤΙΓΜΑΣ

*Who* told Perl it was ok to let me blithely use wide characters in creat but then forbad me from using them in readdir? That's stupid. Perl should forbid unencoded wide characters in syscalls. It already does in syswrite. Why not here?

Almost all (if not all?) Perl functions that take file names have this problem. They all ignore the UTF8 flag.

I would suggest we use a ‘Wide character’ warning\, as we have for print and warn.

Then we also need a pragma to enable Unicode filenames in -e\, open\, readdir\, chdir\, etc.

What should we call it?

What do we do on systems on which file names *are* just octet sequences and nothing more? Make loading the pragma die? Make it warn? Do nothing?

Also\, what about systems that support Unicode\, but for which no one has had the time to implement this? (I’m not going to do VMS\, for instance.)

p5pRT commented 12 years ago

From @ikegami

On Sun\, Sep 18\, 2011 at 8​:40 PM\, Father Chrysostomos via RT \< perlbug-followup@​perl.org> wrote​:

Then we also need a pragma to enable Unicode filenames in -e\, open\, readdir\, chdir\, etc.

What should we call it?

What do we do on systems on which file names *are* just octet sequences and nothing more? Make loading the pragma die? Make it warn? Do nothing?

File names are meant to be read as text\, so one can't really claim they're just octet sequences. So the real question is what should we do when readdir encounters a file name that doesn't cleanly decode using the encoding it's expected to be encoded with (e.g. a file name that's not valid UTF-8 on a box with a UTF-8 locale).

Also\, what about systems that support Unicode\, but for which no one has

had the time to implement this? (I’m not going to do VMS\, for instance.)

Like open -| with multiple args on Windows? Like use open :locale on Windows? Croak.

p5pRT commented 12 years ago

From @ap

* Eric Brine \ikegami@&#8203;adaelis\.com [2011-09-19 03​:20]​:

File names are meant to be read as text\, so one can't really claim they're just octet sequences. So the real question is what should we do when readdir encounters a file name that doesn't cleanly decode using the encoding it's expected to be encoded with (e.g. a file name that's not valid UTF-8 on a box with a UTF-8 locale).

One could take a page from Python here and use its surrogate escape error handling. There was a subthread about it a while ago​: http​://www.nntp.perl.org/group/perl.perl5.porters/;msgid=A8767ACF-E6A0-498A-B402-54A12D26523B@​activestate.com

What this approach effectively does is allow strings to unambiguously represent a mixture of bytes and characters\, which in a roundabout way essentially solves the problem that Perl only has a single string type. But do note the later message about the security implications. It will take some thought to get this clean\, but there is a lot of potential in it.

I love the idea and it is one of my todos to add this to Encode should no one else get there first. The core could then use this method to provide clean and nice interfaces to any OS APIs which are textual in intent but binary in practice – as Python does.

It would be a major step forward for Perl.

Regards\, -- Aristotle Pagaltzis // \<http​://plasmasturm.org/>

p5pRT commented 12 years ago

From @cpansprout

On Tue Oct 04 09​:02​:13 2011\, aristotle wrote​:

* Eric Brine \ikegami@&#8203;adaelis\.com [2011-09-19 03​:20]​:

File names are meant to be read as text\, so one can't really claim they're just octet sequences. So the real question is what should we do when readdir encounters a file name that doesn't cleanly decode using the encoding it's expected to be encoded with

If that happens\, then it’s not really text\, is it?

(e.g. a file name that's not valid UTF-8 on a box with a UTF-8 locale).

No\, no\, please don’t start using the locale to determine what the file names are. That would mean that a change to an environment variable would cause configuration files to start referring to other ‘nonexistent’ files (which exist when the locale is set correctly). We should *only* support Unicode file names when the file system itself has encoding information.

Mac OS X\, for instance\, stores the encoding in the file system (so each volume could theoretically use a different encoding)\, but the low-level drivers that read the volume translate everything to UTF-8. If you try to create a file whose name is an invalid UTF-8 sequence\, you get an ‘Invalid argument’ error.

On the other hand\, if we keep things completely consistent on a given platform (treat Linux as UTF-8\, for instance\, regardless of any environment settings)\, then we could follow Aristotle’s suggestion below for platforms that do not have an inherent file name encoding system.

Also\, nobody has answered my question​: What do we call the pragma? unicode​::filenames? I suppose we need to make a list first of which functions will be affected\, so here goes​:

dbmopen -X chdir chmod chown chroot fcntl glob link lstat mkdir open opendir readlink rename rmdir stat symlink sysopen umask unlink utime do require use

Those are all file name functions.

But what about user and group names?

exec\, system\, syscall\, readpipe\, bind\, connect\, getsockopt\, shmwrite and the various network functions (e.g.\, getservbyname) should produce ‘Wide character’ warnings. (Someone who understands non-ASCII domain names should speak up now.)

One could take a page from Python here and use its surrogate escape error handling. There was a subthread about it a while ago​: http​://www.nntp.perl.org/group/perl.perl5.porters/;msgid=A8767ACF- E6A0-498A-B402-54A12D26523B@​activestate.com

What this approach effectively does is allow strings to unambiguously represent a mixture of bytes and characters\, which in a roundabout way essentially solves the problem that Perl only has a single string type. But do note the later message about the security implications. It will take some thought to get this clean\, but there is a lot of potential in it.

I love the idea and it is one of my todos to add this to Encode should no one else get there first. The core could then use this method to provide clean and nice interfaces to any OS APIs which are textual in intent but binary in practice – as Python does.

It would be a major step forward for Perl.

Regards\,

p5pRT commented 12 years ago

From @Hugmeir

On Sun\, Oct 23\, 2011 at 7​:23 PM\, Father Chrysostomos via RT \< perlbug-followup@​perl.org> wrote​:

On Tue Oct 04 09​:02​:13 2011\, aristotle wrote​:

* Eric Brine \ikegami@&#8203;adaelis\.com [2011-09-19 03​:20]​:

File names are meant to be read as text\, so one can't really claim they're just octet sequences. So the real question is what should we do when readdir encounters a file name that doesn't cleanly decode using the encoding it's expected to be encoded with

If that happens\, then it’s not really text\, is it?

(e.g. a file name that's not valid UTF-8 on a box with a UTF-8 locale).

No\, no\, please don’t start using the locale to determine what the file names are. That would mean that a change to an environment variable would cause configuration files to start referring to other ‘nonexistent’ files (which exist when the locale is set correctly). We should *only* support Unicode file names when the file system itself has encoding information.

Mac OS X\, for instance\, stores the encoding in the file system (so each volume could theoretically use a different encoding)\, but the low-level drivers that read the volume translate everything to UTF-8. If you try to create a file whose name is an invalid UTF-8 sequence\, you get an ‘Invalid argument’ error.

On the other hand\, if we keep things completely consistent on a given platform (treat Linux as UTF-8\, for instance\, regardless of any environment settings)\, then we could follow Aristotle’s suggestion below for platforms that do not have an inherent file name encoding system.

Also\, nobody has answered my question​: What do we call the pragma? unicode​::filenames? I suppose we need to make a list first of which functions will be affected\, so here goes​:

dbmopen -X chdir chmod chown chroot fcntl glob link lstat mkdir open opendir readlink rename rmdir stat symlink sysopen umask unlink utime do require use

Those are all file name functions.

But what about user and group names?

exec\, system\, syscall\, readpipe\, bind\, connect\, getsockopt\, shmwrite and the various network functions (e.g.\, getservbyname) should produce ‘Wide character’ warnings. (Someone who understands non-ASCII domain names should speak up now.)

(Reading the Python thread is still on my TODO list\, so I'm not commenting on that yet)

There's a couple of things here being grouped as one. Ignoring require/use/do for a moment\, most of those functions already have bug reports on them because\, let me quote tchrist here\,

*Who* told Perl it was ok to let me blithely use wide characters in

creat but then forbad me from using them in readdir? That's stupid. Perl should forbid unencoded wide characters in syscalls. It already does in syswrite.

So\, first thing​: Be like syswrite. -All- syscalls\, sans for say/print/printf/warn/die which already have exceptions\, should croak if passed non-downgradeable scalars. This needn't be a backwards-incompatible nightmare -- Save for exec and system\, Classic​::Perl could override them to do something like require Encode; *CORE​::GLOBAL​::rename = sub ($$) { Encode​::SvUTF8_off($_[0]); goto &CORE​::rename }; And there you go. You get Perl's previous ultralax behavior.

Second\, there should be a way to avoid doing an encode/decode on every syscall. Since I haven't read the Python thread yet I can't say much on this\, but for a while I've had a open-like pragma for this in mind\, eg

use syscalls IN => "​:encoding(...)"\, OUT => "​:encoding(...)";

or

use syscalls :dir => { IN => "​:encoding(...)"\, OUT => "​:encoding(...)" }

Or somesuch\, which won't solve problems in\, say\, Windows\, but hopefully it won't make them any worse. Then you could implement unicode​::filenames as a wrapper around that\, and if you want to grab that layer from a locale setting\, that's entirely up to you (just don't ask me to debug it later).

Third\, require/use/do. I recall Python having some problems with this (if the thread that I've neglected reading touches this\, I apologize) -- And actually\, I don't know any language that supports it without issues\, though pointers are of course welcome. Zefram had a great idea for this a while ago -- If a module has Unicode in its path\, it should get an alias\, reachable through some escaping scheme or another. So if I had a module Eeyup​::\x{30cb}​::Bothersome\, Bothersome.pm would be reachable through Eeyup/\x{30cb}/\, and\, failing that\, unialias/Eeyup/130cb/

Here's the nicest thing -- I implemented 1 and a prototype of 2 in a couple of hours\, so it's certainly doable\, though I haven't touched that in a while because I can't figure out a way to test 2 portably.

p5pRT commented 12 years ago

From @cpansprout

On Sun Oct 23 18​:26​:45 2011\, Hugmeir wrote​:

There's a couple of things here being grouped as one. Ignoring require/use/do for a moment\, most of those functions already have bug reports on them because\, let me quote tchrist here\,

*Who* told Perl it was ok to let me blithely use wide characters in

creat but then forbad me from using them in readdir? That's stupid. Perl should forbid unencoded wide characters in syscalls. It already does in syswrite.

So\, first thing​: Be like syswrite. -All- syscalls\, sans for say/print/printf/warn/die which already have exceptions\, should croak if passed non-downgradeable scalars.

(Please\, don’t put -deable at the end of a Latin-based word. :-) It’s ‘downgradable’.)

syswrite seems to be the odd one out. It’s probably using SvPVbyte. print\, die\, and warn just warn (i.e.\, warn chr 256 produces two warnings). It’s a default warning\, though.

With the new pragma\, I would suggest fixing the Unicode bug for those functions when the pragma is off (with a warning and fallback). If that causes CPAN breakage\, then the new behaviour should be enabled with ‘use

Second\, there should be a way to avoid doing an encode/decode on every syscall. Since I haven't read the Python thread yet I can't say much on this\, but for a while I've had a open-like pragma for this in mind\, eg

use syscalls IN => "​:encoding(...)"\, OUT => "​:encoding(...)";

or

use syscalls :dir => { IN => "​:encoding(...)"\, OUT => "​:encoding(...)" }

Or somesuch\, which won't solve problems in\, say\, Windows\, but hopefully it won't make them any worse.

I think it would make things worse\, as we would have yet another non-portable interface that is unusable as a result. In this case it’s not even portable between Unix systems\, because it cannot be used correctly on Mac OS X\, which forces file names on *all* Unix interfaces to be in UTF-8.

On the other hand we could provide it with lots of caveats in the documentation. Maybe it could be part of the same pragma.

Then you could implement unicode​::filenames as a wrapper around that\, and if you want to grab that layer from a locale setting\, that's entirely up to you (just don't ask me to debug it later).

Third\, require/use/do. I recall Python having some problems with this (if the thread that I've neglected reading touches this\, I apologize) -- And actually\, I don't know any language that supports it without issues\, though pointers are of course welcome. Zefram had a great idea for this a while ago -- If a module has Unicode in its path\, it should get an alias\, reachable through some escaping scheme or another. So if I had a module Eeyup​::\x{30cb}​::Bothersome\, Bothersome.pm would be reachable through Eeyup/\x{30cb}/\, and\, failing that\, unialias/Eeyup/130cb/

Here's the nicest thing -- I implemented 1 and a prototype of 2 in a couple of hours\, so it's certainly doable\, though I haven't touched that in a while because I can't figure out a way to test 2 portably.

It sounds like a nice idea at first\, but I worry about modules ‘disappearing’ depending on what pragma is enabled.

p5pRT commented 12 years ago

From @Hugmeir

On Sun\, Oct 23\, 2011 at 11​:44 PM\, Father Chrysostomos via RT \< perlbug-followup@​perl.org> wrote​:

On Sun Oct 23 18​:26​:45 2011\, Hugmeir wrote​:

There's a couple of things here being grouped as one. Ignoring require/use/do for a moment\, most of those functions already have bug reports on them because\, let me quote tchrist here\,

*Who* told Perl it was ok to let me blithely use wide characters in

creat but then forbad me from using them in readdir? That's stupid. Perl should forbid unencoded wide characters in syscalls. It already does in syswrite.

So\, first thing​: Be like syswrite. -All- syscalls\, sans for say/print/printf/warn/die which already have exceptions\, should croak if passed non-downgradeable scalars.

(Please\, don’t put -deable at the end of a Latin-based word. :-) It’s ‘downgradable’.)

But I like my half-broken english..! Fine :P

syswrite seems to be the odd one out. It’s probably using SvPVbyte. print\, die\, and warn just warn (i.e.\, warn chr 256 produces two warnings). It’s a default warning\, though.

That's true\, but consider which one of those has the actually useful behavior. How many times have you gotten a "Wide character" warning that left you with mostly worthless output\, and had to rerun things by adding the layers?

Also\, how often do you actually want to pass the internal form of UTF-8 to system calls? I'm not saying it can't happen\, but it's certainly not the common use case. On nearly every other occasion it's a bug that Perl isn't reporting\, and a warning in this case is twice as useless.

With the new pragma\, I would suggest fixing the Unicode bug for those functions when the pragma is off (with a warning and fallback). If that causes CPAN breakage\, then the new behaviour should be enabled with ‘use

I don't think it wouldn't cause any more breakage than when the Fcntl constants subs became actual ()-prototyped constants. The only things that "broke" were already broken\, but Perl wasn't reporting it.

(I'd have little qualms if this were triggered by a 'use VERSION;' though)

Second\, there should be a way to avoid doing an encode/decode on every syscall. Since I haven't read the Python thread yet I can't say much on this\, but for a while I've had a open-like pragma for this in mind\, eg

use syscalls IN => "​:encoding(...)"\, OUT => "​:encoding(...)";

or

use syscalls :dir => { IN => "​:encoding(...)"\, OUT => "​:encoding(...)" }

Or somesuch\, which won't solve problems in\, say\, Windows\, but hopefully it won't make them any worse.

I think it would make things worse\, as we would have yet another non-portable interface that is unusable as a result. In this case it’s not even portable between Unix systems\, because it cannot be used correctly on Mac OS X\, which forces file names on *all* Unix interfaces to be in UTF-8.

On the other hand we could provide it with lots of caveats in the documentation. Maybe it could be part of the same pragma.

Um\, I'm not sure I follow. Isn't it as portable as the encode/decode calls that you are forced to use right now? If so yeah\, that's pretty bad\, but you can abstract that with something like

use PerlIO​::fse; use syscalls :all => "​:fse";

Then you could implement unicode​::filenames as a wrapper around that\, and if you want to grab that layer from a locale setting\, that's entirely up to you (just don't ask me to debug it later).

Third\, require/use/do. I recall Python having some problems with this (if the thread that I've neglected reading touches this\, I apologize) -- And actually\, I don't know any language that supports it without issues\, though pointers are of course welcome. Zefram had a great idea for this a while ago -- If a module has Unicode in its path\, it should get an alias\, reachable through some escaping scheme or another. So if I had a module Eeyup​::\x{30cb}​::Bothersome\, Bothersome.pm would be reachable through Eeyup/\x{30cb}/\, and\, failing that\, unialias/Eeyup/130cb/

Here's the nicest thing -- I implemented 1 and a prototype of 2 in a couple of hours\, so it's certainly doable\, though I haven't touched that in a while because I can't figure out a way to test 2 portably.

It sounds like a nice idea at first\, but I worry about modules ‘disappearing’ depending on what pragma is enabled.

I was thinking in terms of redefining how the core itself looks for the modules\, that is\, change pp_require and friends. If it's implemented as pragmata\, then your worries are spot-on and that could certainly be troublesome. More boilerplate for the boilerplate god?

p5pRT commented 12 years ago

From @cpansprout

On Sun Oct 23 21​:00​:09 2011\, Hugmeir wrote​:

On Sun\, Oct 23\, 2011 at 11​:44 PM\, Father Chrysostomos via RT \< perlbug-followup@​perl.org> wrote​:

On Sun Oct 23 18​:26​:45 2011\, Hugmeir wrote​: (Please\, don’t put -deable at the end of a Latin-based word. :-) It’s ‘downgradable’.)

But I like my half-broken english..! Fine :P

Please don’t think I’m trying to pick on you. I just see this misuse so often I thought maybe mentioning it once would give others a hint\, too.

Generally\, only the consonants c g k m v m z can have -eable after them\, but there are exceptions.

(You don’t know how long I’ve been wanting to bring this up--but now I’m *way* off topic.)

syswrite seems to be the odd one out. It’s probably using SvPVbyte. print\, die\, and warn just warn (i.e.\, warn chr 256 produces two warnings). It’s a default warning\, though.

That's true\, but consider which one of those has the actually useful behavior. How many times have you gotten a "Wide character" warning that left you with mostly worthless output\, and had to rerun things by adding the layers?

Several hundred. But those were one-time one-liners.

Also\, how often do you actually want to pass the internal form of UTF- 8 to system calls? I'm not saying it can't happen\, but it's certainly not the common use case. On nearly every other occasion it's a bug that Perl isn't reporting\, and a warning in this case is twice as useless.

I think we need to warn\, for backward-compatibility. I know there have been times that I relied on UTF-8 interfaces accepting Unicode strings\, without even realising what I was doing. My code worked\, after all. Then module upgrades broke things\, but only every tenth time or so that the code ran\, so it remained buggy a long time.

With the new pragma\, I would suggest fixing the Unicode bug for those functions when the pragma is off (with a warning and fallback). If that causes CPAN breakage\, then the new behaviour should be enabled with ‘use

I don't think it wouldn't cause any more breakage than when the Fcntl constants subs became actual ()-prototyped constants. The only things that "broke" were already broken\, but Perl wasn't reporting it.

That’s my thought\, but actual smoke reports tend to sway me quickly.

(I'd have little qualms if this were triggered by a 'use VERSION;' though)

Second\, there should be a way to avoid doing an encode/decode on every syscall. Since I haven't read the Python thread yet I can't say much on this\, but for a while I've had a open-like pragma for this in mind\, eg

use syscalls IN => "​:encoding(...)"\, OUT => "​:encoding(...)";

or

use syscalls :dir => { IN => "​:encoding(...)"\, OUT => "​:encoding(...)" }

Or somesuch\, which won't solve problems in\, say\, Windows\, but hopefully it won't make them any worse.

I think it would make things worse\, as we would have yet another non-portable interface that is unusable as a result. In this case it’s not even portable between Unix systems\, because it cannot be used correctly on Mac OS X\, which forces file names on *all* Unix interfaces to be in UTF-8.

On the other hand we could provide it with lots of caveats in the documentation. Maybe it could be part of the same pragma.

Um\, I'm not sure I follow. Isn't it as portable as the encode/decode calls that you are forced to use right now? If so yeah\, that's pretty bad\, but you can abstract that with something like

use PerlIO​::fse; use syscalls :all => "​:fse";

The whole point of the unicode​::filenames pragma is to eliminate the need to have to specify encodings everywhere\, at least as I envision it. After all\, Windows\, VMS and Mac OS X all have character sequences for file names. I think some FreeBSDs might\, too\, but I’m not sure. So your explicit encoding suggestion just seems like a can of worms to me\, which will doubtless be misused in CPAN modules by those who don’t really understand the issues.

Then you could implement unicode​::filenames as a wrapper around that\, and if you want to grab that layer from a locale setting\, that's entirely up to you (just don't ask me to debug it later).

Third\, require/use/do. I recall Python having some problems with this (if the thread that I've neglected reading touches this\, I apologize) -- And actually\, I don't know any language that supports it without issues\, though pointers are of course welcome. Zefram had a great idea for this a while ago -- If a module has Unicode in its path\, it should get an alias\, reachable through some escaping scheme or another. So if I had a module Eeyup​::\x{30cb}​::Bothersome\, Bothersome.pm would be reachable through Eeyup/\x{30cb}/\, and\, failing that\, unialias/Eeyup/130cb/

Here's the nicest thing -- I implemented 1 and a prototype of 2 in a couple of hours\, so it's certainly doable\, though I haven't touched that in a while because I can't figure out a way to test 2 portably.

It sounds like a nice idea at first\, but I worry about modules ‘disappearing’ depending on what pragma is enabled.

I was thinking in terms of redefining how the core itself looks for the modules\, that is\, change pp_require and friends. If it's implemented as pragmata\, then your worries are spot-on and that could certainly be troublesome.

My initial train of thought was a little muddled. In any case\, if perl is to make multiple attempts to load the file\, using different methods\, ignoring any pragmata\, then that concern is irrelevant. But how many attempts should perl be making?

If some OSes use Aristotle’s approach\, then we only need *two* attempts\, and Zefram’s plan\, although it would have been wonderful if 5.8 had implemented it\, will have to be discarded.

There are already people using ‘use Mödule’ on OS X. We shouldn’t break their code.

More boilerplate for the boilerplate god?

???

p5pRT commented 12 years ago

From @khwilliamson

On 10/23/2011 10​:25 PM\, Father Chrysostomos via RT wrote​:

On Sun Oct 23 21​:00​:09 2011\, Hugmeir wrote​:

On Sun\, Oct 23\, 2011 at 11​:44 PM\, Father Chrysostomos via RT\< perlbug-followup@​perl.org> wrote​:

On Sun Oct 23 18​:26​:45 2011\, Hugmeir wrote​: (Please\, don’t put -deable at the end of a Latin-based word. :-) It’s ‘downgradable’.)

But I like my half-broken english..! Fine :P

Please don’t think I’m trying to pick on you. I just see this misuse so often I thought maybe mentioning it once would give others a hint\, too.

Generally\, only the consonants c g k m v m z can have -eable after them\, but there are exceptions.

(You don’t know how long I’ve been wanting to bring this up--but now I’m *way* off topic.)

The macro UTF8_IS_DOWNGRADEABLE_START has been in the core since​: df84a23b01be600297e1e5268d9351b807f107f6 Jarkko Hietaniemi \jhi@&#8203;iki\.fi Wed\, 31 Jan 2001

It's understandable that this spelling has become enshrined as valid. FWIW\, it's never bothered me\, a native English speaker.

p5pRT commented 12 years ago

From @cpansprout

On Mon Oct 24 07​:48​:27 2011\, public@​khwilliamson.com wrote​:

On 10/23/2011 10​:25 PM\, Father Chrysostomos via RT wrote​:

On Sun Oct 23 21​:00​:09 2011\, Hugmeir wrote​:

On Sun\, Oct 23\, 2011 at 11​:44 PM\, Father Chrysostomos via RT\< perlbug-followup@​perl.org> wrote​:

On Sun Oct 23 18​:26​:45 2011\, Hugmeir wrote​: (Please\, don’t put -deable at the end of a Latin-based word. :-) It’s ‘downgradable’.)

But I like my half-broken english..! Fine :P

Please don’t think I’m trying to pick on you. I just see this misuse so often I thought maybe mentioning it once would give others a hint\, too.

Generally\, only the consonants c g k m v m z can have -eable after them\, but there are exceptions.

(You don’t know how long I’ve been wanting to bring this up--but now I’m *way* off topic.)

The macro UTF8_IS_DOWNGRADEABLE_START has been in the core since​: df84a23b01be600297e1e5268d9351b807f107f6 Jarkko Hietaniemi \jhi@&#8203;iki\.fi Wed\, 31 Jan 2001

It's understandable that this spelling has become enshrined as valid. FWIW\, it's never bothered me\, a native English speaker.

I’m a native English speaker\, too\, and it bothers me whenever I see it\, just like ‘referer’.

p5pRT commented 12 years ago

From tchrist@perl.com

The macro UTF8_IS_DOWNGRADEABLE_START has been in the core since​: df84a23b01be600297e1e5268d9351b807f107f6

It's understandable that this spelling has become enshrined as valid. FWIW\, it's never bothered me\, a native English speaker.

I’m a native English speaker\, too\, and it bothers me whenever I see it\, just like ‘referer’.

Now you know how I feel about “numify”. :(

--tom

p5pRT commented 12 years ago

From @khwilliamson

On 10/24/2011 09​:37 AM\, Tom Christiansen wrote​:

The macro UTF8_IS_DOWNGRADEABLE_START has been in the core since​: df84a23b01be600297e1e5268d9351b807f107f6

It's understandable that this spelling has become enshrined as valid. FWIW\, it's never bothered me\, a native English speaker.

I’m a native English speaker\, too\, and it bothers me whenever I see it\, just like ‘referer’.

Now you know how I feel about “numify”. :(

--tom

numify rhymes (the way I pronounce it) with mummify\, which is what happens when you have some Académie dictating what goes into a language and what doesn't.

My grandmother (born 1885\, raised on a Wisconsin farm) hated the term 'kid' when applied to a human child instead of a goat. I found that surprising\, and when I look it up just now\, I see her meaning down the list\, and the 'human' meaning at the top.

I cringe when I hear 'less' when the 'proper' term is 'fewer'. I recently had occasion to use 'pluralize'; I cringed every time I wrote it\, but it got the job done.

We are powerless over the vicissitudes of English\, whose polyglot mutations are\, I believe\, a major reason why it has supplanted French as the required international language that everyone has to learn.

Vive le sandwich!

p5pRT commented 12 years ago

From tchrist@perl.com

My grandmother (born 1885\, raised on a Wisconsin farm)

How odd​: so was mine. 1919-2010.

--tom

p5pRT commented 12 years ago

From @ikegami

On Sun\, Oct 23\, 2011 at 9​:26 PM\, Brian Fraser \fraserbn@&#8203;gmail\.com wrote​:

Second\, there should be a way to avoid doing an encode/decode on every syscall. Since I haven't read the Python thread yet I can't say much on this\, but for a while I've had a open-like pragma for this in mind\, eg

use syscalls IN => "​:encoding(...)"\, OUT => "​:encoding(...)";

or

use syscalls :dir => { IN => "​:encoding(...)"\, OUT => "​:encoding(...)" }

When does it make sense to use two different encodings?

Are you saying that non-Windows system can't tell you which encoding it is using?

p5pRT commented 12 years ago

From @Leont

On Mon\, Oct 24\, 2011 at 10​:07 PM\, Eric Brine \ikegami@&#8203;adaelis\.com wrote​:

Are you saying that non-Windows system can't tell you which encoding it is using?

Most unices (pretty much all of them except OS X) do not have an inherent encoding at all. Filenames are blobs.

Leon

p5pRT commented 12 years ago

From tchrist@perl.com

Karl\, it isn't about shifting word-use. That's a red herring. Rather\, it's about either rank ignorance or willful disregard of the phonologic–orthographic texture of the *written* language.

That is not the way English has ever worked before in any existing precedent. Mummify\, mummification are the precedent you're looking for here\, *not* numen\, numina\, numinal\, numinous\, numinosity.

And somebody goofed. That doesn't make it right\, or good.

It's just like children who get catachrestically named Marybeht because their parents didn't know that you spell the theta sound with a th in English\, not with an ht.

Sure\, you can do it. You can do anything. But it looks stupid and it saddles the poor thing with a lifelong curse.

See also HTTP_REFERER.

--tom

p5pRT commented 12 years ago

From @ikegami

On Mon\, Oct 24\, 2011 at 4​:12 PM\, Leon Timmermans \fawaka@&#8203;gmail\.com wrote​:

On Mon\, Oct 24\, 2011 at 10​:07 PM\, Eric Brine \ikegami@&#8203;adaelis\.com wrote​:

Are you saying that non-Windows system can't tell you which encoding it is using?

Most unices (pretty much all of them except OS X) do not have an inherent encoding at all. Filenames are blobs.

Then how come I can read the file names in file selection dialogs on this Debian box?

p5pRT commented 12 years ago

From @ilmari

Eric Brine \ikegami@&#8203;adaelis\.com writes​:

On Mon\, Oct 24\, 2011 at 4​:12 PM\, Leon Timmermans \fawaka@&#8203;gmail\.com wrote​:

On Mon\, Oct 24\, 2011 at 10​:07 PM\, Eric Brine \ikegami@&#8203;adaelis\.com wrote​:

Are you saying that non-Windows system can't tell you which encoding it is using?

Most unices (pretty much all of them except OS X) do not have an inherent encoding at all. Filenames are blobs.

Then how come I can read the file names in file selection dialogs on this Debian box?

Because the toolkit assumes an encoding\, usually UTF-8. See \<http​://www.gtk.org/api/2.6/glib/glib-Character-Set-Conversion.html#g-get-filename-charsets> for how GTK+ determines it.

-- ilmari "A disappointingly low fraction of the human race is\, at any given time\, on fire." - Stig Sandbeck Mathisen

p5pRT commented 12 years ago

From @ap

* Tom Christiansen \tchrist@&#8203;perl\.com [2011-10-24 22​:35]​:

Sure\, you can do it. You can do anything. But it looks stupid and it saddles the poor thing with a lifelong curse.

See also HTTP_REFERER.

creat

p5pRT commented 12 years ago

From tchrist@perl.com

* Tom Christiansen \tchrist@&#8203;perl\.com [2011-10-24 22​:35]​:

Sure\, you can do it. You can do anything. But it looks stupid and it saddles the poor thing with a lifelong curse.

See also HTTP_REFERER.

creat

creat was not caused by not knowing how to spell the word create.

But you're right that it is something its inventors came to regret having done.

--tom

p5pRT commented 12 years ago

From @Hugmeir

On Mon\, Oct 24\, 2011 at 1​:25 AM\, Father Chrysostomos via RT \< perlbug-followup@​perl.org> wrote​:

That's true\, but consider which one of those has the actually useful behavior. How many times have you gotten a "Wide character" warning that left you with mostly worthless output\, and had to rerun things by adding the layers?

Several hundred. But those were one-time one-liners.

Also\, how often do you actually want to pass the internal form of UTF- 8 to system calls? I'm not saying it can't happen\, but it's certainly not the common use case. On nearly every other occasion it's a bug that Perl isn't reporting\, and a warning in this case is twice as useless.

I think we need to warn\, for backward-compatibility. I know there have been times that I relied on UTF-8 interfaces accepting Unicode strings\, without even realising what I was doing. My code worked\, after all. Then module upgrades broke things\, but only every tenth time or so that the code ran\, so it remained buggy a long time.

I don't think it wouldn't cause any more breakage than when the Fcntl constants subs became actual ()-prototyped constants. The only things that "broke" were already broken\, but Perl wasn't reporting it.

That’s my thought\, but actual smoke reports tend to sway me quickly.

Actually\, how about a CPAN smoke of this? If the extent of the breakage is reasonable\, I'll personally send patches to all the affected modules : ) And as an added bonus\, even if the core doesn't change to croak\, it'll improve the overall robustness of CPAN!

The whole point of the unicode​::filenames pragma is to eliminate the need to have to specify encodings everywhere\, at least as I envision it. After all\, Windows\, VMS and Mac OS X all have character sequences for file names. I think some FreeBSDs might\, too\, but I’m not sure. So your explicit encoding suggestion just seems like a can of worms to me\, which will doubtless be misused in CPAN modules by those who don’t really understand the issues.

Hm.. That's true enough. I was a bit wary of something automatically picking the fs encoding for me\, but then I noticed that the most common use case of a pragma that had you explicitly set the encodings would be to load a module to do exactly that! (e.g. the PerlIO​::fse example in my previous mail). Having that as the default seems reasonable. Though it would be swell if it provided a way to override those defaults.

(Would you consider calling it unicode​::syscalls or somesuch? :​:filenames implies it wouldn't affect\, say\, qx//)

My initial train of thought was a little muddled. In any case\, if perl is to make multiple attempts to load the file\, using different methods\, ignoring any pragmata\, then that concern is irrelevant. But how many attempts should perl be making?

If some OSes use Aristotle’s approach\, then we only need *two* attempts\, and Zefram’s plan\, although it would have been wonderful if 5.8 had implemented it\, will have to be discarded.

Yeah\, you are right. I don't think I fully understand Aristotle's proposal (though many thanks to him for taking time to explain it to me on IRC)\, but it seems pretty good. Now someone just has to write it : )

There are already people using ‘use Mödule’ on OS X. We shouldn’t break their code.

That probably won't work for the latin-1 range though\, and the lack of normalization on our side\, while the OS does it\, is and will be troublesome. But personally\, I was thinking of exempting use/require/do for the time being\, for two main reasons; first\, properly overriding/encoding those is non-trivial\, and second\, it's not a issue that should matter to people writing Perl; How perl finds its stuff should concert only (mostly) perl.

More boilerplate for the boilerplate god?

???

Sorry\, in-joke.