Open p5pRT opened 6 years ago
Parsing non-ASCII command-lines on Win32 is moderately broken.
Whether in codepage 65001 or not\, non-ASCII (or at least\, non-system codepage) characters are not preserved. For example:
J:\dev\perl\git\perl\win32>..\perl -I..\lib -MDevel::Peek -Mutf8 -e "Dump(qq(αω) )" SV = PV(0x34f6f8) at 0x67c578 REFCNT = 1 FLAGS = (POK\,IsCOW\,READONLY\,PROTECT\,pPOK) PV = 0x688998 "a?"\0 CUR = 2 LEN = 10 COW_REFCNT = 0
J:\dev\perl\git\perl\win32>..\perl -I..\lib -MDevel::Peek -Mutf8 -e "Dump(shift) " αω SV = PV(0x52efb8) at 0x5bbfa0 REFCNT = 1 FLAGS = (TEMP\,POK\,pPOK) PV = 0x5abfe8 "a?"\0 CUR = 2 LEN = 10
The -CA switch makes no difference.
While this is caused by the interaction between the C runtime and the Win32 API\, should we be bypassing that to get some sensible behaviour out of it?
Simplest (for the user) would be for -CA to retrieve the UTF-16 command-line arguments and convert them to UTF-8 before @ARGV is populated\, though this is complicated by us parsing argv while we're updating it.
Relying only on -CA might also have some backcompat issues\, since if there's Win32 code that blindly sets -CA\, it might break if we start returning the real value. This change could be considered a bug fix though.
To simplify the implementation we might depend on an environment variable instead\, but that has the same sort of problem that setting PERL_UNICODE= by default had on file handling.
On Sun\, 02 Sep 2018 18:26:01 -0700 "Tony Cook \(via RT\)" \perlbug\-followup@​perl\.org wrote:
Parsing non-ASCII command-lines on Win32 is moderately broken.
Whether in codepage 65001 or not\, non-ASCII (or at least\, non-system codepage) characters are not preserved. For example:
J:\dev\perl\git\perl\win32>..\perl -I..\lib -MDevel::Peek -Mutf8 -e "Dump(qq(??) )" SV = PV(0x34f6f8) at 0x67c578 REFCNT = 1 FLAGS = (POK\,IsCOW\,READONLY\,PROTECT\,pPOK) PV = 0x688998 "a?"\0 CUR = 2 LEN = 10 COW_REFCNT = 0
J:\dev\perl\git\perl\win32>..\perl -I..\lib -MDevel::Peek -Mutf8 -e "Dump(shift) " ?? SV = PV(0x52efb8) at 0x5bbfa0 REFCNT = 1 FLAGS = (TEMP\,POK\,pPOK) PV = 0x5abfe8 "a?"\0 CUR = 2 LEN = 10
The -CA switch makes no difference.
While this is caused by the interaction between the C runtime and the Win32 API\, should we be bypassing that to get some sensible behaviour out of it?
Simplest (for the user) would be for -CA to retrieve the UTF-16 command-line arguments and convert them to UTF-8 before @ARGV is populated\, though this is complicated by us parsing argv while we're updating it.
Relying only on -CA might also have some backcompat issues\, since if there's Win32 code that blindly sets -CA\, it might break if we start returning the real value. This change could be considered a bug fix though.
To simplify the implementation we might depend on an environment variable instead\, but that has the same sort of problem that setting PERL_UNICODE= by default had on file handling.
While this ticket is about @ARGV and not filenames\, I consider it a duplicate of https://rt.perl.org/Public/Bug/Display.html?id=130831 (see also a recent thread on p5p: https://www.nntp.perl.org/group/perl.perl5.porters/2018/08/msg251899.html).
It's *exactly* the same problem. In this case\, command line arguments are being converted from the console codepage to the system codepage (usually they're not the same!). If a character is invalid or doesn't exist in the system codepage\, it is being replaced with either '?' (most locales) or REPLACEMENT CHARACTER (when *system* codepage is set to 65001\, which is possible only on Windows 10).
While this brings *horrible* experience to perl users\, I don't think it's a bug. Also\, changing this behaviour would obviously break stuff.
The RT System itself - Status changed from 'new' to 'open'
On Sun\, 02 Sep 2018 19:49:57 -0700\, me@xenu.pl wrote:
While this ticket is about @ARGV and not filenames\, I consider it a duplicate of https://rt.perl.org/Public/Bug/Display.html?id=130831 (see also a recent thread on p5p: https://www.nntp.perl.org/group/perl.perl5.porters/2018/08/msg251899.html).
It's *exactly* the same problem. In this case\, command line arguments are being converted from the console codepage to the system codepage (usually they're not the same!). If a character is invalid or doesn't exist in the system codepage\, it is being replaced with either '?' (most locales) or REPLACEMENT CHARACTER (when *system* codepage is set to 65001\, which is possible only on Windows 10).
It's certainly a strongly related problem\, but it's not the same problem.
It came up while I was diagnosing code that accepted a filename (the name in the test case) and called the Win32 specific APIs to open it and failed.
While this brings *horrible* experience to perl users\, I don't think it's a bug. Also\, changing this behaviour would obviously break stuff.
I think it's something we can improve.
The main issue right now is code that accepts strings from the command-line get nonsensical results - unless the caller does nonsensical things.
The attached patch modifies perl to re-generate argv from the UTF-16 command-line if it sees a -CA switch\, and it works for me for commands run from the command prompt.
However\, one test fails:
run/switchC.t .. 1..15 ok 1 - -CO: no warning on UTF-8 output ok 2 - -C2: no warning on UTF-8 output ok 3 - -CI: read in UTF-8 input ok 4 - -CE: UTF-8 stderr ok 5 - -Co: auto-UTF-8 open for output ok 6 - -Ci: auto-UTF-8 open for input ok 7 - -Ci: auto-UTF-8 open for input affects the current file ok 8 - -Ci: auto-UTF-8 open for input has file scope # Failed test 9 - -CA: @ARGV at run/switchC.t line 78 # got '196' not ok 9 - -CA: @ARGV# expected /(?^s:^256(?:\r?\n)?$)/
ok 10 - \#!perl -C
ok 11 - \#!perl -C followed by another switch
ok 12 - \#!perl -C\
This fails because backticks have the same type of problem - backticks don't know about unicode either. By the time the chr(256) gets to win32_popen() all it sees is "\xc4\c80"\, and it can't tell if that was Latin1 or UTF-8\, so it's passed through as 00C4 0080 rather than 0100.
Tony
On Tue\, 04 Sep 2018 20:40:16 -0700\, tonyc wrote:
While this brings *horrible* experience to perl users\, I don't think it's a bug. Also\, changing this behaviour would obviously break stuff.
I think it's something we can improve.
The main issue right now is code that accepts strings from the command-line get nonsensical results - unless the caller does nonsensical things.
The attached patch modifies perl to re-generate argv from the UTF-16 command-line if it sees a -CA switch\, and it works for me for commands run from the command prompt.
Also\, it breaks embedding\, so don't apply this patch.
Maybe an alternative is to not make it depend on the -CA switch\, but on the current code page.
If the current code page is 65001 then main() (in win32/runperl.c) could do the conversion to utf-8 I do in my patch.
The program then depends on the normal -CA behaviour to treat that as UTF-8\, so perl code sees Unicode in @ARGV.
It does mean that a user has to do something unusual (chcp 65001) to get reasonable behaviour.
Tony
On 09/05/2018 05:20 PM\, Tony Cook via RT wrote:
On Tue\, 04 Sep 2018 20:40:16 -0700\, tonyc wrote:
While this brings *horrible* experience to perl users\, I don't think it's a bug. Also\, changing this behaviour would obviously break stuff.
I think it's something we can improve.
The main issue right now is code that accepts strings from the command-line get nonsensical results - unless the caller does nonsensical things.
The attached patch modifies perl to re-generate argv from the UTF-16 command-line if it sees a -CA switch\, and it works for me for commands run from the command prompt.
Also\, it breaks embedding\, so don't apply this patch.
Maybe an alternative is to not make it depend on the -CA switch\, but on the current code page.
If the current code page is 65001 then main() (in win32/runperl.c) could do the conversion to utf-8 I do in my patch.
The program then depends on the normal -CA behaviour to treat that as UTF-8\, so perl code sees Unicode in @ARGV.
It does mean that a user has to do something unusual (chcp 65001) to get reasonable behaviour.
Tony
--- via perlbug: queue: perl5 status: open https://rt-archive.perl.org/perl5/Ticket/Display.html?id=133496
I haven't looked at this thread in detail\, but using script runs can be used to disambiguate some things. Perhaps that would help\, so think about that possibility. I have unreleased code for Pod::Simple that improves CP1252 vs UTF-8 detection that might be instructive.
On Wed\, 05 Sep 2018 16:20:03 -0700 "Tony Cook via RT" \perlbug\-followup@​perl\.org wrote:
Also\, it breaks embedding\, so don't apply this patch.
Maybe an alternative is to not make it depend on the -CA switch\, but on the current code page.
If the current code page is 65001 then main() (in win32/runperl.c) could do the conversion to utf-8 I do in my patch.
The program then depends on the normal -CA behaviour to treat that as UTF-8\, so perl code sees Unicode in @ARGV.
It does mean that a user has to do something unusual (chcp 65001) to get reasonable behaviour.
Tony
You mean the console codepage? There are some problem with that approach.
Console codepages don't exist in windows subsystem applications (like wperl.exe)\, GetConsoleCP() returns 0 in them:
C:\Users\xenu>wperl -MWin32 -E "open my($fh)\, '>'\, 'a.txt'; print {$fh} Win32::GetConsoleCP()" C:\Users\xenu>type a.txt 0
Another problem is that it won't cover situations where it's impossible to change console codepage\, for example when perl.exe is launched via explorer.exe (e.g. via .lnk shortcut or when some file extension is associated with a perl script).
I think that the only reasonable way to fix the win32 unicode bug is to introduce a way to globally force utf-8 everywhere\, i.e. @ARGV\, filenames and env variables. -C flag used to serve this exact purpose[1]\, but this functionality was removed in 5.8.1.
IMO we should reintroduce that switch.
On second thought\, I think\, in the long run\, we should enable unicode handling by default and add a switch which would restore the old behaviour for scripts that rely on it. IMO that would be the most reasonable approach\, because the current behaviour is *completely* broken and I'm pretty sure that changing it would fix more code than it would break.
On Thu\, 06 Sep 2018 11:44:09 -0700\, me@xenu.pl wrote:
On Wed\, 05 Sep 2018 16:20:03 -0700 "Tony Cook via RT" \perlbug\-followup@​perl\.org wrote:
Also\, it breaks embedding\, so don't apply this patch.
Maybe an alternative is to not make it depend on the -CA switch\, but on the current code page.
If the current code page is 65001 then main() (in win32/runperl.c) could do the conversion to utf-8 I do in my patch.
The program then depends on the normal -CA behaviour to treat that as UTF-8\, so perl code sees Unicode in @ARGV.
It does mean that a user has to do something unusual (chcp 65001) to get reasonable behaviour.
Tony
You mean the console codepage? There are some problem with that approach.
Console codepages don't exist in windows subsystem applications (like wperl.exe)\, GetConsoleCP() returns 0 in them:
C:\Users\xenu>wperl -MWin32 -E "open my($fh)\, '>'\, 'a.txt'; print {$fh} Win32::GetConsoleCP()" C:\Users\xenu>type a.txt 0
Another problem is that it won't cover situations where it's impossible to change console codepage\, for example when perl.exe is launched via explorer.exe (e.g. via .lnk shortcut or when some file extension is associated with a perl script).
I think that the only reasonable way to fix the win32 unicode bug is to introduce a way to globally force utf-8 everywhere\, i.e. @ARGV\, filenames and env variables. -C flag used to serve this exact purpose[1]\, but this functionality was removed in 5.8.1.
The argv handling looks similar to what my patch does - with the same problem for embedding.
The wide system calls handling appears to assume all SVs are UTF-8 encoded\, even without the SVf_UTF8 flag set.
IMO we should reintroduce that switch.
On second thought\, I think\, in the long run\, we should enable unicode handling by default and add a switch which would restore the old behaviour for scripts that rely on it. IMO that would be the most reasonable approach\, because the current behaviour is *completely* broken and I'm pretty sure that changing it would fix more code than it would break.
I think fixing @ARGV would be reasonably painless for backcompat\, but the rest wide-character support is too likely to break things\, I think.
I wonder how much CPAN testing was done with -C for perl 5.8.0.
Tony
Migrated from rt.perl.org#133496 (status was 'open')
Searchable as RT133496$