Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.99k stars 559 forks source link

Perl's open() has broken Unicode file name support #15883

Open p5pRT opened 7 years ago

p5pRT commented 7 years ago

Migrated from rt.perl.org#130831 (status was 'open')

Searchable as RT130831$

p5pRT commented 7 years ago

From @pali

Function open() has broken processing of non-ASCII file names.

Look at these two examples​:

$ perl -e 'open my $file\, ">"\, "\N{U+FF}"'

$ perl -e 'open my $file\, ">"\, "\xFF"'

First one create file with name 0xc3 0xbf (ÿ)\, second one with name 0xff

And because those two strings "\N{U+FF}" and "\xFF" are equal they must create same file\, not two different.

$ perl -e '"\xFF" eq "\N{U+FF}" && print "equal\n"' equal

Bug is in open() implementation in PP(pp_open) in file pp_sys.c.

File name is read from perl scalar to C char* as​:

tmps = SvPV_const(sv\, len);

But after that SvUTF8(sv) is *not* used to check if char* tmps is encoded in UTF-8 or Latin1. It pass tmps directly to do_open6() function without SvUTF8 information.

So to fixing this bug it is needed to define how function open should process filename. Either as binary octets and SvPVbyte() instead of SvPV() should be used\, or as Unicode string and SvPVutf8() instead of SvPV() should be used.

It also means that it is needed to define what Perl_do_open6() should expect. Its argument for file name is of type​: const char *oname. It should be either binary octets or UTF-8.

There are basically two problems with it​:

1) On some systems (e.g. on Linux) file name could be arbitrary sequence of binary characters. It does not have to be valid UTF-8 representation.

2) Perl modules probably already uses perl Unicode scalars as argument for file names

And decision should still allow to open any file on VFS from 1) and probably should not break 2). And I'm not sure if it is possible to have both 1) and 2) together.

Current state is worse as both 1) and 2) is broken.

p5pRT commented 7 years ago

From @jkeenan

On Tue\, 21 Feb 2017 20​:58​:03 GMT\, pali@​cpan.org wrote​:

Function open() has broken processing of non-ASCII file names.

Look at these two examples​:

$ perl -e 'open my $file\, ">"\, "\N{U+FF}"'

$ perl -e 'open my $file\, ">"\, "\xFF"'

First one create file with name 0xc3 0xbf (ÿ)\, second one with name 0xff

And because those two strings "\N{U+FF}" and "\xFF" are equal they must create same file\, not two different.

$ perl -e '"\xFF" eq "\N{U+FF}" && print "equal\n"' equal

Bug is in open() implementation in PP(pp_open) in file pp_sys.c.

File name is read from perl scalar to C char* as​:

tmps = SvPV_const(sv\, len);

But after that SvUTF8(sv) is *not* used to check if char* tmps is encoded in UTF-8 or Latin1. It pass tmps directly to do_open6() function without SvUTF8 information.

So to fixing this bug it is needed to define how function open should process filename. Either as binary octets and SvPVbyte() instead of SvPV() should be used\, or as Unicode string and SvPVutf8() instead of SvPV() should be used.

It also means that it is needed to define what Perl_do_open6() should expect. Its argument for file name is of type​: const char *oname. It should be either binary octets or UTF-8.

There are basically two problems with it​:

1) On some systems (e.g. on Linux) file name could be arbitrary sequence of binary characters. It does not have to be valid UTF-8 representation.

2) Perl modules probably already uses perl Unicode scalars as argument for file names

And decision should still allow to open any file on VFS from 1) and probably should not break 2). And I'm not sure if it is possible to have both 1) and 2) together.

Current state is worse as both 1) and 2) is broken.

ISTR seeing a fair amount of discussion of this issue on #p5p. Would anyone care to summarize this discussion?

Thank you very much.

-- James E Keenan (jkeenan@​cpan.org)

p5pRT commented 7 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 7 years ago

From @pali

Some more informations​:

Windows has two sets of functions for accessing files. First with -A suffix which takes file names in encoding of current 8bit codepage. Second with -W suffix which takes file names in Unicode (more precisely in Windows variant of UTF-16). With -A functions it is possible to access only those files which file names contains only characters available in current 8bit codepage. Internally are all file names stored in Unicode. So -W functions must be used to have access to any file name. And therefore for Windows we need Unicode file name in perl open() function to have access to any file stored on disk.

Linux stores file names in binary octets\, there is no encoding or requirement for Unicode. Therefore to access any file on Linux\, Perl's open() function should takes downgraded/non-Unicode file name.

Which means there is no way to have uniform and same multiplaform support for file access without hacks.

I'm thinking that for Linux we could specify some (hint) variable which will contains encoding name (it can be hidden in some pragma module...). And then Perl's open() function can takes Unicode file name and can convert it to encoding (specified by that variable). As default value for that variable (for encoding) can be used from locale or defaults to UTF-8 (which is probably most used and sane default value).

This would allow us to have uniform open() function with takes Unicode file name on (probably) any platform. I think this is the only sane approach if Perl want to support Unicode file names.

But problem is how currently Perl's open() function is implemented. It expects bytes or Unicode string?

p5pRT commented 7 years ago

From zefram@fysh.org

pali@​cpan.org wrote​:

Which means there is no way to have uniform and same multiplaform support for file access without hacks.

Depends what you're trying to do "uniformly". If you want to be able to open any file\, then each platform has an obvious way of representing any filename as a Perl string (as a full Unicode string on Windows and as an octet string on Unix)\, so using Perl strings for filenames could be a uniform interface. The format of filename strings does vary between platforms\, but we already have such variation in the directory separators\, and we have File​::Spec to provide a uniform interface to it.

The thing that can't be done uniformly is to generate a filename from an arbitrary Unicode string in accordance with the platform's conventions. We could of course add a File​::Spec method that attempts to do this\, but there's a fundamental problem that Unix doesn't actually have a consistent convention for it. But this isn't really a big problem. We don't need to use arbitrary Unicode strings\, that weren't intended to be filenames\, as filenames. It's something to avoid​: a lot of security problems have arisen from programs that tried to use arbitrary data strings in this way.

The strings that we should be using as filenames are strings that are explicitly specified by the user as filenames. The user\, at runtime\, can be expected to be aware of platform conventions and to supply locally-appropriate filenames.

I'm thinking that for Linux we could specify some (hint) variable which will contains encoding name (it can be hidden in some pragma module...).

Ugh. If the `hint' is lexically scoped\, this loses as soon as a filename crosses a module boundary. If global\, that would be saner; it's effectively part of the interface to the OS. But you then have a backcompat issue that you have to handle encoding failures in code paths that currently never generate exceptions. There's also a terrible problem with OS interfaces that return filenames (readdir(3)\, readlink(2)\, et al)​: you have to *decode* the filename\, and if it doesn't decode then you've lost the ability to work with arbitrary existing files.

                                                As default value 

for that variable (for encoding) can be used from locale or defaults to UTF-8 (which is probably most used and sane default value).

These are both crap as defaults. The locale's nominal encoding is quite likely to be ASCII\, and both ASCII and UTF-8 are incapable of generating certain octet strings as output. Thus if filenames are subjected to either of these encodings then it is impossible for the user to specify some filenames that are valid at the syscall interface\, and if such a filename actually exists then you run into the above-mentioned decoding problem. For example\, the one-octet string "\xc0" doesn't decode as either ASCII or UTF-8. The only sane default\, if you want to offer this encoding system\, is Latin-1\, which behaves as a null encoding on Perl octet strings.

The trouble here really arises because the scheme effectively uses the encoding in reverse. Normally we use a character encoding to encode a character string as an octet string\, so that we can store those octets and later read them to recover the original character string. With Unix filenames\, however\, the thing that we want to represent and store\, which is the filename as it appears at the OS interface\, is an octet string. The encoding layer\, if there is one\, is concerned with representing that octet string as a character string. An encoding that can't handle all octet strings is a problem\, just as in normal circumstances a character encoding that can't handle all character strings is a problem. Most character encodings are just not designed to be used in reverse\, and don't have a design goal of encoding to all octet strings or of decode-then-encode round-tripping.

But problem is how currently Perl's open() function is implemented. It expects bytes or Unicode string?

The current behaviour is broken on any platform. To get to anything sane we will need a change that breaks some backcompat. In that situation we are not constrained by the present arrangement of the open() internals.

-zefram

p5pRT commented 7 years ago

From @Leont

On Mon\, Feb 27\, 2017 at 10​:21 PM\, \pali@​cpan\.org wrote​:

Windows has two sets of functions for accessing files. First with -A suffix which takes file names in encoding of current 8bit codepage. Second with -W suffix which takes file names in Unicode (more precisely in Windows variant of UTF-16). With -A functions it is possible to access only those files which file names contains only characters available in current 8bit codepage. Internally are all file names stored in Unicode. So -W functions must be used to have access to any file name. And therefore for Windows we need Unicode file name in perl open() function to have access to any file stored on disk.

Linux stores file names in binary octets\, there is no encoding or requirement for Unicode. Therefore to access any file on Linux\, Perl's open() function should takes downgraded/non-Unicode file name.

Which means there is no way to have uniform and same multiplaform support for file access without hacks.

Correct observations. Except OS X makes this more complicated still​: it uses UTF-8 encoded bytes\, normalized using a non-standard variation of NFD.

I'm thinking that for Linux we could specify some (hint) variable which will contains encoding name (it can be hidden in some pragma module...). And then Perl's open() function can takes Unicode file name and can convert it to encoding (specified by that variable). As default value for that variable (for encoding) can be used from locale or defaults to UTF-8 (which is probably most used and sane default value).

This would allow us to have uniform open() function with takes Unicode file name on (probably) any platform. I think this is the only sane approach if Perl want to support Unicode file names.

I would welcome a 'unicode_filenames' feature. I don't think any value other than binary is sane on Linux though. I think we learned from perl 5.8.0.

But problem is how currently Perl's open() function is implemented. It expects bytes or Unicode string?

Both. Neither. Welcome to The Unicode Bug.

Leon

p5pRT commented 7 years ago

From @tonycoz

On Tue\, 21 Feb 2017 12​:58​:03 -0800\, pali@​cpan.org wrote​:

So to fixing this bug it is needed to define how function open should process filename. Either as binary octets and SvPVbyte() instead of SvPV() should be used\, or as Unicode string and SvPVutf8() instead of SvPV() should be used.

It also means that it is needed to define what Perl_do_open6() should expect. Its argument for file name is of type​: const char *oname. It should be either binary octets or UTF-8.

This sounds like something that could be prototyped on CPAN by replacing CORE​::GLOBAL​::open\, CORE​::GLOBAL​::readdir etc.

Tony

p5pRT commented 7 years ago

From @pali

On Monday 27 February 2017 15​:27​:32 Zefram via RT wrote​:

I'm thinking that for Linux we could specify some (hint) variable which will contains encoding name (it can be hidden in some pragma module...).

Ugh. If the `hint' is lexically scoped\, this loses as soon as a filename crosses a module boundary. If global\, that would be saner;

Yes\, global. Ideally something which can be set when starting perl (e.g. perl parameter) or via env variable.

it's effectively part of the interface to the OS.

Yes. And due to this reasons modules in normal cases should not change value of that variable.

But you then have a backcompat issue that you have to handle encoding failures in code paths that currently never generate exceptions. There's also a terrible problem with OS interfaces that return filenames (readdir(3)\, readlink(2)\, et al)​: you have to *decode* the filename\, and if it doesn't decode then you've lost the ability to work with arbitrary existing files.

We can use Encode​::encode() function in non-croak mode which replace invalid characters by some replacement and throw warning about it.

This could be default behaviour so all those OS related functions do not die. Maybe there could be some switch (feature?) which change mode of encode function to die. And new could can handle and deal with it.

                                                As default value 

for that variable (for encoding) can be used from locale or defaults to UTF-8 (which is probably most used and sane default value).

These are both crap as defaults. The locale's nominal encoding is quite likely to be ASCII\, and both ASCII and UTF-8 are incapable of generating certain octet strings as output.

It is not a crap as default. Currently locale encoding is what is used for such actions. It is used for converting multibyte characters into octets and vice-versa in other applications.

So if your locale encoding is set to ASCII then more applications are unable to print on your terminal non-ascii characters.

But as there are too many functions from Unicode space to bytes and more are in some cases "correct" and more are used\, there is no one which should be used. So when you chose any you still get problems.

Therefore locale encoding is what we can use as it is the only one information which we have from operating system here.

Thus if filenames are subjected to either of these encodings then it is impossible for the user to specify some filenames that are valid at the syscall interface\, and if such a filename actually exists then you run into the above-mentioned decoding problem. For example\, the one-octet string "\xc0" doesn't decode as either ASCII or UTF-8. The only sane default\, if you want to offer this encoding system\, is Latin-1\, which behaves as a null encoding on Perl octet strings.

Latin-1 is not sane as it is unable to handle Unicode strings with characters above U+0000FF. It wrong as ASCII or UTF-8.

The trouble here really arises because the scheme effectively uses the encoding in reverse. Normally we use a character encoding to encode a character string as an octet string\, so that we can store those octets and later read them to recover the original character string. With Unix filenames\, however\, the thing that we want to represent and store\, which is the filename as it appears at the OS interface\, is an octet string. The encoding layer\, if there is one\, is concerned with representing that octet string as a character string. An encoding that can't handle all octet strings is a problem\, just as in normal circumstances a character encoding that can't handle all character strings is a problem. Most character encodings are just not designed to be used in reverse\, and don't have a design goal of encoding to all octet strings or of decode-then-encode round-tripping.

If we want to handle any Unicode string created in perl and passed to Perl's open() function we need to use some Unicode transformation function.

If we want to open arbitrary file stored on disk (in bytes) then we need to use encoding which maps from whole space of characters to some Unicode strings.

Both cannot be achieved. And if there is some function it is still not useful. As file names on disk are already stored in some encoding. Just kernel do not care about it and even it do not know that encoding.

So user or application (or library or system) must know in which encoding are stored file names. And this should be present in current locale.

Therefore I suggest to use default encoding from locale with ability to change it. So if user has stored files in different encoding as specified in locale\, then user has already problem to handle such files in applications which uses wchar_t and probably already know how to deal with it...

Either temporary change locale encoding or passing some argument to perl (or env variable or perl variable) to specify correct one.

But problem is how currently Perl's open() function is implemented. It expects bytes or Unicode string?

The current behaviour is broken on any platform. To get to anything sane we will need a change that breaks some backcompat. In that situation we are not constrained by the present arrangement of the open() internals.

We can define new use feature 'unicode_filenames' or something like that and then Perl's open() function can be "fixed".

p5pRT commented 7 years ago

From @pali

On Tuesday 28 February 2017 00​:35​:45 Leon Timmermans wrote​:

On Mon\, Feb 27\, 2017 at 10​:21 PM\, \pali@​cpan\.org wrote​:

Windows has two sets of functions for accessing files. First with -A suffix which takes file names in encoding of current 8bit codepage. Second with -W suffix which takes file names in Unicode (more precisely in Windows variant of UTF-16). With -A functions it is possible to access only those files which file names contains only characters available in current 8bit codepage. Internally are all file names stored in Unicode. So -W functions must be used to have access to any file name. And therefore for Windows we need Unicode file name in perl open() function to have access to any file stored on disk.

Linux stores file names in binary octets\, there is no encoding or requirement for Unicode. Therefore to access any file on Linux\, Perl's open() function should takes downgraded/non-Unicode file name.

Which means there is no way to have uniform and same multiplaform support for file access without hacks.

Correct observations. Except OS X makes this more complicated still​: it uses UTF-8 encoded bytes\, normalized using a non-standard variation of NFD.

It is not a problem or complicated issue. It just means that OS X uses also Unicode API\, same as Windows. Just uses different representation of Unicode\, say OS X variant of UTF-8. We have no problem here to generate OS X representation from perl string and vice-versa. It just needs platform specific code\, same as Windows for its variant of UTF-16.

I'm thinking that for Linux we could specify some (hint) variable which will contains encoding name (it can be hidden in some pragma module...). And then Perl's open() function can takes Unicode file name and can convert it to encoding (specified by that variable). As default value for that variable (for encoding) can be used from locale or defaults to UTF-8 (which is probably most used and sane default value).

This would allow us to have uniform open() function with takes Unicode file name on (probably) any platform. I think this is the only sane approach if Perl want to support Unicode file names.

I would welcome a 'unicode_filenames' feature. I don't think any value other than binary is sane on Linux though. I think we learned from perl 5.8.0.

But problem is how currently Perl's open() function is implemented. It expects bytes or Unicode string?

Both. Neither. Welcome to The Unicode Bug.

So it is time for feature unicode_filenames and fix that bug.

p5pRT commented 7 years ago

From zefram@fysh.org

pali@​cpan.org wrote​:

We can use Encode​::encode() function in non-croak mode which replace invalid characters by some replacement

No\, that fucks up the filenames. After such a substituting decode\, re-encoding the result will produce some octet string different from the original. So if you read a filename from a directory\, attempting to use that filename to address the file will at best fail because it's a non-existent name. (If you're unlucky then it'll address a *different* file.)

So if your locale encoding is set to ASCII then more applications are unable to print on your terminal non-ascii characters.

I don't follow your argument here. You don't seem to be addressing the crapness of making it impossible to deal with arbitrary filenames at the syscall interface.

Latin-1 is not sane as it is unable to handle Unicode strings with characters above U+0000FF. It wrong as ASCII or UTF-8.

My objective isn't to make every Unicode string represent a filename. My objective is to have every filename represented by some Perl string. Latin-1 would be a poor choice in situations where it is desired to represent arbitrary Unicode strings\, but it's an excellent choice for the job of representing filenames. Different jobs have different requirements\, leading to different design choices.

So user or application (or library or system) must know in which encoding are stored file names. And this should be present in current locale.

Impossible. The locale model of character encoding (as you treat it here) is fundamentally broken. The model is that every string in the universe (every file content\, filename\, command line argument\, etc.) is encoded in the same way\, and the locale environment variable tells you which universe you're in. But in the real universe\, files\, filenames\, and so on turn up encoded how their authors liked to encode them\, and that's not always the same. In the real universe we have to cope with data that is not encoded in our preferred way.

The locale encoding is OK if one treats it strictly as a user *preference*. What one can do with such a preference without risking running into uncooperative data is quite limited.

      So if user has stored files in different encoding as

specified in locale\, then user has already problem to handle such files

I run in the C locale\, which on this system has nominally ASCII encoding (which is in fact my preferred encoding)\, and yet I occasionally run into filenames that are derived from UTF-8 or Latin-1 encoding. Do you realise how much difficulty I have in dealing with such files? None at all. For my shell is 8-bit clean\, and every program I use just passes the octet string straight through (e.g.\, from argv to syscalls). This is a healthy system.

The only programs I've encountered that have any difficulty with non-ASCII filenames are two language implementations (Rakudo Perl 6 and GNU Guile 2.0) that I don't use for real work. Both of them have decided\, independently\, that filenames must be encodings of arbitrary Unicode strings. Interestingly\, they've reached different conclusions about what encoding is used​: Guile considers it to be the locale's nominal encoding\, whereas Rakudo reckons it's UTF-8 regardless of locale. (Rakudo is making an attempt to augment its concept of Unicode strings to be able to represent arbitrary Unicode strings in a way compatible with UTF-8\, but that's not fully working yet\, and I'm not convinced that it can ever work satisfactorily.) Don't make the same mistake as these projects.

We can define new use feature 'unicode_filenames' or something like that and then Perl's open() function can be "fixed".

That would be a lexically-scoped effect\, which (as mentioned earlier) loses as soon as a filename crosses a module boundary.

-zefram

p5pRT commented 7 years ago

From zefram@fysh.org

I wrote​:

(Rakudo is making an attempt to augment its concept of Unicode strings to be able to represent arbitrary Unicode strings in a way compatible with UTF-8\,

Oops\, I meant "arbitrary octet strings" there.

-zefram

p5pRT commented 7 years ago

From @pali

On Thursday 02 March 2017 04​:23​:35 Zefram via RT wrote​:

pali@​cpan.org wrote​:

We can use Encode​::encode() function in non-croak mode which replace invalid characters by some replacement

No\, that fucks up the filenames. After such a substituting decode\, re-encoding the result will produce some octet string different from the original. So if you read a filename from a directory\, attempting to use that filename to address the file will at best fail because it's a non-existent name. (If you're unlucky then it'll address a *different* file.)

So if your locale encoding is set to ASCII then more applications are unable to print on your terminal non-ascii characters.

I don't follow your argument here. You don't seem to be addressing the crapness of making it impossible to deal with arbitrary filenames at the syscall interface.

Understood. As wrote in my first email we probably cannot have both ability to access arbitrary file and having uniform access to files represented by perl Unicode strings.

Latin-1 is not sane as it is unable to handle Unicode strings with characters above U+0000FF. It wrong as ASCII or UTF-8.

My objective isn't to make every Unicode string represent a filename.

Basically output from ordinary applications are Unicode file names\, not bytes\, which is shown to users.

Same\, user enter into file open dialog or into console stdin filename as sequence of key press which represents some characters (which fully maps to Unicode) and not sequence of bytes.

Also I want to create file named "ÿ" with perl in same way on Windows and Linux.

So to have fixed open() we need to be able to represent every perl Unicode string as file name. (With possibility to fail if underlaying system is not able to store current file name)

My objective is to have every filename represented by some Perl string.

I understand... and in current model with perl strings it is impossible.

Latin-1 would be a poor choice in situations where it is desired to represent arbitrary Unicode strings\,

Right!

but it's an excellent choice for the job of representing filenames. Different jobs have different requirements\, leading to different design choices.

So user or application (or library or system) must know in which encoding are stored file names. And this should be present in current locale.

Impossible. The locale model of character encoding (as you treat it here) is fundamentally broken.

Yes\, it is broken. But problem is that it is used by system applications... :-(

The locale encoding is OK if one treats it strictly as a user *preference*. What one can do with such a preference without risking running into uncooperative data is quite limited.

      So if user has stored files in different encoding as

specified in locale\, then user has already problem to handle such files

I run in the C locale\, which on this system has nominally ASCII encoding (which is in fact my preferred encoding)\, and yet I occasionally run into filenames that are derived from UTF-8 or Latin-1 encoding. Do you realise how much difficulty I have in dealing with such files? None at all. For my shell is 8-bit clean\, and every program I use just passes the octet string straight through (e.g.\, from argv to syscalls). This is a healthy system.

Probably some programs like "ls" is not able to print UTF-8 encoded file names into your terminal...

The only programs I've encountered that have any difficulty with non-ASCII filenames are two language implementations (Rakudo Perl 6 and GNU Guile 2.0) that I don't use for real work. Both of them have decided\, independently\, that filenames must be encodings of arbitrary Unicode strings. Interestingly\, they've reached different conclusions about what encoding is used​: Guile considers it to be the locale's nominal encoding\, whereas Rakudo reckons it's UTF-8 regardless of locale. (Rakudo is making an attempt to augment its concept of Unicode strings to be able to represent arbitrary Unicode strings in a way compatible with UTF-8\, but that's not fully working yet\, and I'm not convinced that it can ever work satisfactorily.) Don't make the same mistake as these projects.

We can define new use feature 'unicode_filenames' or something like that and then Perl's open() function can be "fixed".

That would be a lexically-scoped effect\, which (as mentioned earlier) loses as soon as a filename crosses a module boundary.

We need to store "unicode filename" information into perl scalar itself. And make sure it wont be lost when doing assignment or another string functions...

Another idea​:

Cannot we create new magic like for vstring which would contains additional informations for file name? Functions like readdir could properly create such magic scalars and when passed to open it would correctly handle it. And like vstring it could contain some string representation in PV slot\, so it would be possible to pass such scalar into print/warn functions or any XS functions which would not be capable of that new magic. In magic property could be stored platform/system dependent settings\, like which encoding is used.

This could fix problem of accessing arbitrary file\, you just compose magic scalar (maybe via some function or pragma) in system dependent representation and then pass it into open(). And also fix problem to pass any Unicode file name\, you compose normal perl Unicode string and based on some settings it would be converted by open() to system dependent representation. open() would first try to use magic properties and if they are not present then it fallback to Encode on content of string. Maybe usage of Encode needs to be enabled by globally (or locally).

It is usable? Or are there also problems?

p5pRT commented 7 years ago

From zefram@fysh.org

pali@​cpan.org wrote​:

On Thursday 02 March 2017 04​:23​:35 Zefram via RT wrote​:

My objective is to have every filename represented by some Perl string.

I understand... and in current model with perl strings it is impossible.

No\, it *is* possible\, and easy. What's not possible is to do that and simultaneously achieve your other goal of having almost all Unicode strings represent some filename in a manner that's conventional for the platform. One of these goals is more important than the other.

Probably some programs like "ls" is not able to print UTF-8 encoded file names into your terminal...

It can't print them *literally*\, and it handles that issue quite well. GNU ls(1) pays attention to the locale encoding in a sensible manner\, mainly looking at the character repertoire. In the ASCII locale\, by default it displays a question mark in place of high-half octets\, which clues me in that there's a problematic octet. With the -b option it represents them as backslash escapes\, which if need be I can copy into a shell $'' construct. Actually tab completion is almost always the solution to entering the filename at the shell\, and the completion that it generates uses $''. This is a healthy system​: I have no difficulty in examining and using awkward filenames through my preferred medium of ASCII.

Cannot we create new magic like for vstring which would contains additional informations for file name?

No. This would be octet-vs-character distinction all over again; see several previous discussions on p5p. vstrings kinda work\, though not fully\, because we hardly ever perform string operations on version numbers with an expectation of producing a version number as output. But we manipulate filenames by string means all the time.

-zefram

p5pRT commented 7 years ago

From @xenu

On Sat\, 4 Mar 2017 05​:21​:37 +0000 Zefram \zefram@​fysh\.org wrote​:

pali@​cpan.org wrote​:

On Thursday 02 March 2017 04​:23​:35 Zefram via RT wrote​:

My objective is to have every filename represented by some Perl string.

I understand... and in current model with perl strings it is impossible.

No\, it *is* possible\, and easy.

Is it? Remember that we're also talking about Windows.

p5pRT commented 7 years ago

From zefram@fysh.org

Tomasz Konojacki wrote​:

Is it? Remember that we're also talking about Windows.

See upthread. The easy way to do it is different on Windows from how it is on Unix\, but in both cases there's an obvious and simple way to represent all native filenames as Perl strings. The parts that would be platform-dependent are reasonably well localised within the core; programs written in Perl wouldn't need to be aware of the difference.

An issue that we haven't yet considered is passing filenames as command-line arguments. Before Unicode\, we could expect something like open(H\, "\<"\, $ARGV[0]) to work. (Well\, pre-SvUTF8 Perl didn't have three-arg open\, but apart from the syntax that would work.) Currently $ENV{PERL_UNICODE} means that a program can't fully predict how argv[] will be mapped into @​ARGV\, but as it happens the Unicode bug in open() papers over that\, so feeding an @​ARGV element directly into open() like this will still work. (You lose if you perform any string operation on the way\, though.)

In any system with a fixed open()\, this probably ought to continue to work​: a filename supplied as a command-line argument\, in the platform's conventional manner\, should yield an @​ARGV element which\, if fed to open() et al\, functions as that filename. Unlike the question of encoding character strings as filenames\, Unix does have well-defined conventions for this\, with argv elements and filenames in the syscall API both being nul-terminated octet strings\, and an identity mapping expected between them.

What about on Windows? What form does argv[] take\, in its most native version? How does one conventionally encode a Unicode filename as a command-line argument?

-zefram

p5pRT commented 7 years ago

From @pali

On Saturday 04 March 2017 06​:22​:18 you wrote​:

pali@​cpan.org wrote​:

On Thursday 02 March 2017 04​:23​:35 Zefram via RT wrote​:

My objective is to have every filename represented by some Perl string.

I understand... and in current model with perl strings it is impossible.

No\, it *is* possible\, and easy. What's not possible is to do that and simultaneously achieve your other goal of having almost all Unicode strings represent some filename in a manner that's conventional for the platform. One of these goals is more important than the other.

So it is not possible (at least not easy). See my first post which I wrote to this bug. For you it is just not important\, but it is important for me + other people too. And what I wrote in first post is a bug which I would like to see fixed.

As wrote\, I want to create file named "ÿ" which is stored in perl string. And I should be able to do it via perl uniform function without hacks like $^O.

Cannot we create new magic like for vstring which would contains additional informations for file name?

No.

Why?

This would be octet-vs-character distinction all over again;

But this is your argument. On Linux it is needed to use octets as file name to support arbitrary file stored on disk.

see several previous discussions on p5p.

Any pointers?

vstrings kinda work\, though not fully\, because we hardly ever perform string operations on version numbers with an expectation of producing a version number as output. But we manipulate filenames by string means all the time.

Yes\, but what is the problem? It would be magic scalar we all get/set operations on it could be implemented in platform dependent manner.

Also functions like readdir can correctly prepare such scalar\, so if you modify or directly pass to open\, you will open any file correctly.

So what is the problem with this idea?

p5pRT commented 7 years ago

From @pali

On Saturday 04 March 2017 15​:28​:02 you wrote​:

Tomasz Konojacki wrote​:

Is it? Remember that we're also talking about Windows.

See upthread. The easy way to do it is different on Windows from how it is on Unix\, but in both cases there's an obvious and simple way to represent all native filenames as Perl strings.

You suggest that on Linux we should use only binary octets for file name. Such thing will not work on Windows\, where you need to pass Unicode string as file names.

So if user want to create file named "ÿ"\, then it would be needed to do something like this​:

use utf8; my $filename = "ÿ"; utf8​::encode($filename) $O^ ne "MSWin32"; open my $file\, ">"\, $filename or die;

(resp. replace utf8​::encode with another function which converts perl Unicode string to byte octets).

So\, this your approach is not useful. As script for creating file named "ÿ" would need to deal with all platforms and its dependent behaviour.

To solve this problem\, you need to be able to pass Unicode string as file name into open.

What about on Windows? What form does argv[] take\, in its most native version? How does one conventionally encode a Unicode filename as a command-line argument?

Like other winapi functions\, for argv here you have also -A and -W variants. -A is encoded in current locale and -W in modified UTF-16. So if you want you can take Unicode string.

p5pRT commented 7 years ago

From zefram@fysh.org

pali@​cpan.org wrote​:

So if user want to create file named "ÿ"\,

You can't do this\, because\, at the level you're specifying it\, this isn't a well-defined action on Unix. Some encoding needs to be used to turn the character into an octet string\, and there isn't anything intrinsic to the platform that determines which encoding to use.

The code that you then give is a bit more specific. I think the effect you're trying to specify in the code is that you use the octet string "\xc3\xbf" on Unix and the character string "\x{ff}" on Windows. If this lower-level description is actually what you want to achieve\, then you should expect to need platform-dependent code to do it\, because this is by definition a platform-dependent effect.

You *could* make the top-level program cleaner by hiding the platform dependence\, and on Unix the choice of encoding\, in a module. Your program could then look like

  open my $file\, ">"\, pali_filename_encode("\xff") or die;

The filename encoder translates an arbitrary Unicode string into a filename in a manner that is conventional for the platform\, and represents the filename as a Perl string in the manner required for open(). It could well become part of File​::Spec. Note that the corresponding decoder must fail on some inputs.

-zefram

p5pRT commented 7 years ago

From @pali

On Sunday 05 March 2017 11​:44​:40 you wrote​:

pali@​cpan.org wrote​:

So if user want to create file named "ÿ"\,

You can't do this\, because\, at the level you're specifying it\, this isn't a well-defined action on Unix. Some encoding needs to be used to turn the character into an octet string\, and there isn't anything intrinsic to the platform that determines which encoding to use.

The code that you then give is a bit more specific. I think the effect you're trying to specify in the code is that you use the octet string "\xc3\xbf" on Unix and the character string "\x{ff}" on Windows. If this lower-level description is actually what you want to achieve\, then you should expect to need platform-dependent code to do it\, because this is by definition a platform-dependent effect.

You *could* make the top-level program cleaner by hiding the platform dependence\, and on Unix the choice of encoding\, in a module. Your program could then look like

open my $file\, ">"\, pali\_filename\_encode\("\\xff"\) or die;

The filename encoder translates an arbitrary Unicode string into a filename in a manner that is conventional for the platform\, and represents the filename as a Perl string in the manner required for open(). It could well become part of File​::Spec. Note that the corresponding decoder must fail on some inputs.

-zefram

Exactly! This is what high-level program want to do and achieve. They really should do not care about low-level OS differences.

Decoder does not have to always fail on non-encodable input. It can e.g. directly use Encode module and allow caller to specify what to do with bad input​: https://metacpan.org/pod/Encode#Handling-Malformed-Data

But before we can start implementing such thing (e.g. in File​::Spec module) we need to have defined API for open() and resolved this bug ("\xFF" eq "\N{U+FF}") which I described in first post. Because now it is not specified if open() takes Unicode string or byte octets...

p5pRT commented 6 years ago

From @pali

On Tuesday 28 February 2017 00​:35​:45 Leon Timmermans wrote​:

On Mon\, Feb 27\, 2017 at 10​:21 PM\, \pali@&#8203;cpan\.org wrote​:

Windows has two sets of functions for accessing files. First with -A suffix which takes file names in encoding of current 8bit codepage. Second with -W suffix which takes file names in Unicode (more precisely in Windows variant of UTF-16). With -A functions it is possible to access only those files which file names contains only characters available in current 8bit codepage. Internally are all file names stored in Unicode. So -W functions must be used to have access to any file name. And therefore for Windows we need Unicode file name in perl open() function to have access to any file stored on disk.

Linux stores file names in binary octets\, there is no encoding or requirement for Unicode. Therefore to access any file on Linux\, Perl's open() function should takes downgraded/non-Unicode file name.

Which means there is no way to have uniform and same multiplaform support for file access without hacks.

Correct observations. Except OS X makes this more complicated still​: it uses UTF-8 encoded bytes\, normalized using a non-standard variation of NFD.

For completeness​:

Windows uses UCS-2 for file names and also in corresponding WinAPI -W functions which operates with file names. It is not UTF-16 as file names may really have unpaired surrogates.

OS X uses non-standard variant of Unicode NFD encoded in UTF-8.

Linux use just binary octets.

Idea how to handle file names in Perl​:

Store file names in extended Perl's Unicode (with code points above U+1FFFFF). Non-extended code points would represent normal Unicode code points. And code points above U+1FFFFF would represent parts of file name which cannot be unambiguously represented in Unicode.

On Linux\, take file name (which is char*) and start decoding it from UTF-8. Sequence of bytes which cannot be decoded as UTF-8 would be decoded as sequence of extended code points (e.g. U+200000 - U+2000FF). This operation has inverse therefore can be used for conversion of any file name stored on Linux system. Plus it is UTF-8 friendly\, if filenames in VFS are stored in UTF-8 (which is now common)\, then perl's say function can correctly print them.

On OS X\, take file name (which is char* but in UTF-8) and just decode it from UTF-8. For conversion from Perl's Unicode to char* just do that non-standard NFD normalization and encode to UTF-8.

On Windows\, take file name (wchar_t* which is uint16_t*) compatible for -W WinAPI function which represents UCS-2 sequence and decode it to Unicode. There can be unpaired surrogates and represents it either as Unicode surrogate code points\, or use extended Perl's code points (bove U+1FFFFF). Reverse process (from perl's Unicode to wchar_t*/uint16_t*) is obvious.

p5pRT commented 6 years ago

From @dur-randir

On Mon\, 20 Aug 2018 01​:48​:07 -0700\, pali@​cpan.org wrote​:

Store file names in extended Perl's Unicode (with code points above U+1FFFFF). Non-extended code points would represent normal Unicode code points. And code points above U+1FFFFF would represent parts of file name which cannot be unambiguously represented in Unicode.

And then someone passes this string to an API call that expects well-formed UTF-8\, and everything crashes. Perl core has recently taken a lot of steps to allow only well-formed UTF-8 string to be available to user\, and now you suggest to take a step back - I don't think that's a good idea.

It could work if you could separate such strings into their own namespace - but that'd require and API change for all filesystem-related functions.

p5pRT commented 6 years ago

From @pali

On Tuesday 21 August 2018 02​:02​:18 Sergey Aleynikov via RT wrote​:

On Mon\, 20 Aug 2018 01​:48​:07 -0700\, pali@​cpan.org wrote​:

Store file names in extended Perl's Unicode (with code points above U+1FFFFF). Non-extended code points would represent normal Unicode code points. And code points above U+1FFFFF would represent parts of file name which cannot be unambiguously represented in Unicode.

And then someone passes this string to an API call that expects well-formed UTF-8\, and everything crashes. Perl core has recently taken a lot of steps to allow only well-formed UTF-8 string to be available to user\, and now you suggest to take a step back - I don't think that's a good idea.

It could work if you could separate such strings into their own namespace - but that'd require and API change for all filesystem-related functions.

Yesterday on IRC I presented following idea\, which could solve above problem.

Introduce a new qf operator which takes Unicode string and returns perl object which would represent file name. Internally object itself can store file name as it needs (e.g. sequence of integer code points\, if storing code points above U+1FFFFF in UTF-8 string is bad) and every perl's filesystem function (like open()) would interpret these file name objects specially -- without The Unicode bug\, etc...

Also functions like readdir() would return these file name objects instead of regular strings.

Those file name objects could have proper stringification operator to always produce printable string of file name. And for those non-representable code points above U+1FFFFF\, stringification function can escape it via some ASCII sequences.

This would allow​: In module ABC to create a file name via qf operator and pass it into module CDE which calls open() on argument passed from module ABC.

All those fs functions (like open()) would work like before\, so there would not be any regression for existing code. Just when passed argument is that special object\, it would be handled differently.

p5pRT commented 6 years ago

From @dur-randir

On Tue\, 21 Aug 2018 02​:11​:41 -0700\, pali@​cpan.org wrote​:

Introduce a new qf operator which takes Unicode string and returns perl object which would represent file name. Internally object itself can store file name as it needs (e.g. sequence of integer code points\, if storing code points above U+1FFFFF in UTF-8 string is bad) and every perl's filesystem function (like open()) would interpret these file name objects specially -- without The Unicode bug\, etc...

Also functions like readdir() would return these file name objects instead of regular strings.

Yeah\, that's a path of changing API.

plk commented 3 years ago

Did anything happen about this? There is a general issue with unicode from the environment/command-line in Windows which is really causing issues with a large app I have as the usage of it has exploded and we have to support more and more languages. The issue seems to be discussed, with a suggested fix, here:

https://www.nu42.com/2017/02/perl-unicode-windows-trilogy-one.html

I did test this fix and it does solve the issues I have but I'm not in a position to say whether it's enough in general. Also, since I use PAR::Packer to provide pseudo-binaries and this uses a custom runperl() and not perl.exe, the fix above needs to be ported to PAR::Packer and the maintainer of that doesn't want to change anything in there until upstream perl addresses this. It's a bit unfortunate that Perl seems almost alone in having this sort of issue on Windows ...

tonycoz commented 3 years ago

It's complex, if we want to:

it's not going to be a simple change.

Leont commented 3 years ago

The additional difficulty is that we would probably need a solution that makes sense on both Unix and on Windows, despite the two having wildly different handling of character encodings in the file-system (and other system APIs).

plk commented 3 years ago

Given how long these issues have been extant, can I assume then that they won't ever be fixed? It's just somewhat sad that this is going to provide more fuel to the general ascendency of python which, as far as I know, doesn't have these issues ...

xenu commented 3 years ago

Unicode on Windows almost certainly won't be fixed in 5.34, but no one said it won't happen in a later release. We have a pretty good understanding of those issues, they were discussed many times on various channels and there's definitely the will to fix them.

It's an extremely complicated issue and we're still yet to decide how exactly it should be fixed, but I'm sure we will get there eventually.

xenu commented 3 years ago

BTW, there's a workaround for those issues. If you're using Windows 10 1803 or newer, enabling "Use Unicode UTF-8 for worldwide language support" checkbox in Region Settings will magically make UTF-8 filenames work in perl.

Keep in mind that this switch is global and it may break some legacy applications.

tonycoz commented 3 years ago

BTW, there's a workaround for those issues. If you're using Windows 10 1803 or newer, enabling "Use Unicode UTF-8 for worldwide language support" checkbox in Region Settings will magically make UTF-8 filenames work in perl.

Keep in mind that this switch is global and it may break some legacy applications.

It won't fix upgraded vs downgraded SVs referring to different filenames.

xenu commented 3 years ago

Sure, but that issue exists on the other platforms (like Linux) too.

plk commented 3 years ago

Very useful to know that beta Windows option and that MS is finally joining everyone on UTF-8. I tried this and it indeed worked nicely and presumably reduced the messing about in the future solving this.

khwilliamson commented 3 years ago

Shouldn't we document this somehow?

FGasper commented 3 years ago

I wrote a module that I think fixes at least the “upgraded vs downgraded SVs referring to different filenames” problem: https://metacpan.org/pod/Sys::Binmode

salva commented 3 years ago

There is an "easy" work-around for handling filenames that are not valid UTF-8 or UTF-16 in OSs where those encodings are the default.

Perl utf8 is able to encode characters in the range 0-0x7FFFFFFFFFFFFFFF but currently Unicode defines less than 300000 symbols. That means that most of that space is unused and is going to remain unused for the foreseeable future.

We can create an encoding (butf8 - bijective utf8) that uses some of those unused codes (for instance, the last 128) to represent invalid utf-8 sequences..

For example, in an OS where LC_TYPE is set to en_US.UTF-8, a filename just containing the byte 0xC0 would be butf8-decoded as the string "\x{7FFFFFFFFFFFFFC0}". That string is valid utf8 and can be manipulated in Perl without worries.

The user may prepend the string foo to generate the file name "foo\x{7FFFFFFFFFFFFFC0}" that when passed back to the OS is butf8-encoded to the byte string "foo\xc0".

Characters with codes in the reserved range appearing in file names should also be handled as a special case. For instance, if the filename contains the sequence of bytes 0xff, 0x80, 0x87, 0xbf, 0xbf, 0xbf, 0xbf, 0xbf, 0xbf, 0xbf, 0xbf, 0xbf, 0x80 (utf8 for "\x{7FFFFFFFFFFFFFC0}"), butf8-decoding it would result in the string "\x{7FFFFFFFFFFFFFff}\x{7FFFFFFFFFFFFF80}\x{7FFFFFFFFFFFFF87}\x{7FFFFFFFFFFFFFbf}\x{7FFFFFFFFFFFFFbf}\x{7FFFFFFFFFFFFFbf}\x{7FFFFFFFFFFFFFbf}\x{7FFFFFFFFFFFFFbf}\x{7FFFFFFFFFFFFFbf}\x{7FFFFFFFFFFFFFbf}\x{7FFFFFFFFFFFFFbf}\x{7FFFFFFFFFFFFFbf}\x{7FFFFFFFFFFFFF80}". That's required in order to guarantee the bijectiveness of the encoding.

In Windows, invalid UTF-16 sequences can be represented as pairs of bytes as in the previous case, or just a 16bits space can be reserved.

pali commented 3 years ago

@salva: This is something which I have already suggested in this discussion. See my comment https://github.com/Perl/perl5/issues/15883#issuecomment-544087843 and then following discussion (as it has some issues).

salva commented 3 years ago

@pali, I had missed your comment.

I don't see any issue with that approach. Code points up to 0x7FFFFFFFFFFFFFFF can already be generated in Perl and so, they can already be passed to external functions. If that is not desirable, then, what's required is a SvPVutf8_strict() set of functions that check that the contents of the SV can actually be converted to UTF-8.

That's also just an hypothetical issue. In practice, most libraries know that bad data exists and handle it in some way (for instance, ignoring it, or signalling an error).

In the end, what I see is that it is 2021 and Perl doesn't know yet to access my file system correctly. This is a very critical issue for people outside the ASCII bubble!!!

Just blocking any proposed solution given because of minor issues is the wrong thing to do.

salva commented 3 years ago

Just for reference:

salva@opti:/tmp$ mkdir test-ñ
salva@opti:/tmp$ cd test-ñ/
salva@opti:/tmp/test-ñ$ python3 -c 'open("python-\xf1", "w")'
salva@opti:/tmp/test-ñ$ scala -e 'import java.io._; new FileOutputStream(new File("jvm-\u00f1"))'
salva@opti:/tmp/test-ñ$ ruby -e 'File.open("ruby-\xf1", "w")'
salva@opti:/tmp/test-ñ$ perl -e 'open F, ">", "perl-\xf1"'
salva@opti:/tmp/test-ñ$ ls
 jvm-ñ  'perl-'$'\361'   python-ñ  'ruby-'$'\361'

and...

salva@opti:/tmp/test-ñ$ python3 -c 'open("python-arg-ñ", "w")'
salva@opti:/tmp/test-ñ$ ruby -e 'File.open("ruby-u-\u00f1", "w")'
salva@opti:/tmp/test-ñ$ ruby -e 'File.open("ruby-arg-ñ", "w")'
salva@opti:/tmp/test-ñ$ perl -e 'open F, ">", "perl-arg-ñ"'
salva@opti:/tmp/test-ñ$ scala -e 'import java.io._; new FileOutputStream(new File("jvm-arg-ñ"))'
salva@opti:/tmp/test-ñ$ ls
 jvm-arg-ñ   jvm-ñ  'perl-'$'\361'   perl-arg-ñ   python-arg-ñ   python-ñ  'ruby-'$'\361'   ruby-arg-ñ   ruby-u-ñ

So, it seems all of those languages but perl do the right thing :-(

FGasper commented 3 years ago

@salortiz What do you think of the notion of using extra flags in the SV to indicate that the string is text? Then Perl could implement the semantics you envision:

my $path = "path-\xf1";
text::set($path);
open my $fh, '>', $path; # encodes as per the environment

This would even facilitate working Windows filesystem operations. :)

pali commented 3 years ago

@FGasper: This is also something which I proposed in this tiket. See my comment https://github.com/Perl/perl5/issues/15883#issuecomment-544087846 about qf operator for this purpose and then following discussion which reveal that this approach has also issues.

FGasper commented 3 years ago

@pali Sort of … my proposal is to use extra flags on the SV to solve the more general problem of differentiating text strings from byte strings. Th qf// idea seems specifically geared to filesystem interaction, but would it not suit to solve filesystem stuff in the broader context of text-vs.-byte strings?

salva commented 3 years ago

What do you think of the notion of using extra flags in the SV to indicate that the string is text?

IMO that is the wrong approach. You are just pushing into the developer the responsibility to encode/decode the data before/after calling any file-system related builtin.

Using the right encoding at the file system level is not something optional that you do when you know you have some non-ASCII data. On the contrary, it must be the default and every piece of code must take that into account always, and the sensible way to make that happen is to show Perl how to do it transparently using sane defaults.

In practice that means doing what every other language is already doing:

And then, decide what is more important to you, absolute backward compatibility so that the feature is only available when explicitly activated (use feature 'sane_filesystem';); or globally (for instance, via the -C flag); or by default.

Finally add the required machinery to let the user/developer disable it when for whatever reason, he wants to do otherwise.

FGasper commented 3 years ago

I certainly agree that an encode-for-the-system-by-default workflow makes the most sense. As long as it also preserves an easy way to express any arbitrary filename that the system supports, it sounds good to me.