@ARGV, -CA and Win32 - Githubissues

p5pRT commented 6 years ago

Migrated from rt.perl.org#133496 (status was 'open')

Searchable as RT133496$

p5pRT commented 6 years ago

From @tonycoz

Created by @tonycoz

Parsing non-ASCII command-lines on Win32 is moderately broken.

Whether in codepage 65001 or not\, non-ASCII (or at least\, non-system codepage) characters are not preserved. For example:

J:\dev\perl\git\perl\win32>..\perl -I..\lib -MDevel::Peek -Mutf8 -e "Dump(qq(αω) )" SV = PV(0x34f6f8) at 0x67c578 REFCNT = 1 FLAGS = (POK\,IsCOW\,READONLY\,PROTECT\,pPOK) PV = 0x688998 "a?"\0 CUR = 2 LEN = 10 COW_REFCNT = 0

J:\dev\perl\git\perl\win32>..\perl -I..\lib -MDevel::Peek -Mutf8 -e "Dump(shift) " αω SV = PV(0x52efb8) at 0x5bbfa0 REFCNT = 1 FLAGS = (TEMP\,POK\,pPOK) PV = 0x5abfe8 "a?"\0 CUR = 2 LEN = 10

The -CA switch makes no difference.

While this is caused by the interaction between the C runtime and the Win32 API\, should we be bypassing that to get some sensible behaviour out of it?

Simplest (for the user) would be for -CA to retrieve the UTF-16 command-line arguments and convert them to UTF-8 before @ARGV is populated\, though this is complicated by us parsing argv while we're updating it.

Relying only on -CA might also have some backcompat issues\, since if there's Win32 code that blindly sets -CA\, it might break if we start returning the real value. This change could be considered a bug fix though.

To simplify the implementation we might depend on an environment variable instead\, but that has the same sort of problem that setting PERL_UNICODE= by default had on file handling.

Perl Info

``` Flags: category=core severity=low Site configuration information for perl 5.29.3: Configured by tony at Mon Sep 3 10:54:31 2018. Summary of my perl5 (revision 5 version 29 subversion 3) configuration: Platform: osname=MSWin32 osvers=6.1.7601 archname=MSWin32-x64-multi-thread uname='' config_args='undef' hint=recommended useposix=true d_sigaction=undef useithreads=define usemultiplicity=define use64bitint=define use64bitall=undef uselongdouble=undef usemymalloc=n default_inc_excludes_dot=define bincompat5005=undef Compiler: cc='gcc' ccflags =' -s -O2 -DWIN32 -DWIN64 -DCONSERVATIVE -DPERL_TEXTMODE_SCRIPTS -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -fwrapv -fno-strict-aliasing -mms-bitfields' optimize='-s -O2' cppflags='-DWIN32' ccversion='' gccversion='4.9.2' gccosandvers='' intsize=4 longsize=4 ptrsize=8 doublesize=8 byteorder=12345678 doublekind=3 d_longlong=define longlongsize=8 d_longdbl=define longdblsize=16 longdblkind=3 ivtype='long long' ivsize=8 nvtype='double' nvsize=8 Off_t='long long' lseeksize=8 alignbytes=8 prototype=define Linker and Libraries: ld='g++' ldflags ='-s -L"c:\perl\lib\CORE" -L"C:\MinGW\lib"' libpth=C:\MinGW\lib libs= -lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32 perllibs= -lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32 libc= so=dll useshrplib=true libperl=libperl529.a gnulibc_version='' Dynamic Linking: dlsrc=dl_win32.xs dlext=dll d_dlsymun=undef ccdlflags=' ' cccdlflags=' ' lddlflags='-mdll -s -L"c:\perl\lib\CORE" -L"C:\MinGW\lib"' @INC for perl 5.29.3: J:/dev/perl/git/perl/lib Environment for perl 5.29.3: HOME (unset) LANG (unset) LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=c:\sperl-5.22.0.1-64bit-portable\perl\site\bin;c:\sperl-5.22.0.1-64bit-portable\perl\bin;c:\sperl-5.22.0.1-64bit-portable\c\bin;C:\Program Files (x86)\Common Files\Oracle\Java\javapath;C:\ProgramData\Oracle\Java\javapath;C:\Program Files (x86)\Common Files\Intel\Shared Files\cpp\bin\Intel64;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\bin\;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\libnvvp\;C:\Program Files (x86)\Common Files\MICROSOFT SHARED\WINDOWS LIVE;C:\PROGRAM FILES (X86)\MIKTEX 2.8\MIKTEX\BIN;C:\Windows\SYSTEM32;C:\Windows;C:\Windows\SYSTEM32\WBEM;C:\APPS\GIT\GIT\CMD;C:\PROGRAM FILES (X86)\ORACLE\BERKELEY DB 11GR2 5.3.15\BIN;C:\Program Files\WIDCOMM\Bluetooth Software\;C:\Program Files\WIDCOMM\Bluetooth Software\syswow64;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files\Microsoft SQL Server\110\Tools\Binn\;C:\Program Files (x86)\Microsoft SDKs\TypeScript\1.0\;C:\Program Files\Microsoft SQL Server\120\Tools\Binn\;C:\Users\tony\.dnx\bin;C:\Program Files\Microsoft DNX\Dnvm\;C:\Program Files (x86)\nodejs\;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;J:\postgresql\pg96\bin;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files\Git\cmd;C:\Users\tony\.dnx\bin;C:\Users\tony\AppData\Roaming\npm;C:\Program Files (x86)\Dr. Memory\bin\ PERL_BADLANG (unset) SHELL (unset) ```

p5pRT commented 6 years ago

From @xenu

On Sun\, 02 Sep 2018 18:26:01 -0700 "Tony Cook $via RT$" \perlbug\-followup@perl\.org wrote:

Parsing non-ASCII command-lines on Win32 is moderately broken.

Whether in codepage 65001 or not\, non-ASCII (or at least\, non-system codepage) characters are not preserved. For example:

J:\dev\perl\git\perl\win32>..\perl -I..\lib -MDevel::Peek -Mutf8 -e "Dump(qq(??) )" SV = PV(0x34f6f8) at 0x67c578 REFCNT = 1 FLAGS = (POK\,IsCOW\,READONLY\,PROTECT\,pPOK) PV = 0x688998 "a?"\0 CUR = 2 LEN = 10 COW_REFCNT = 0

J:\dev\perl\git\perl\win32>..\perl -I..\lib -MDevel::Peek -Mutf8 -e "Dump(shift) " ?? SV = PV(0x52efb8) at 0x5bbfa0 REFCNT = 1 FLAGS = (TEMP\,POK\,pPOK) PV = 0x5abfe8 "a?"\0 CUR = 2 LEN = 10

The -CA switch makes no difference.

While this is caused by the interaction between the C runtime and the Win32 API\, should we be bypassing that to get some sensible behaviour out of it?

Simplest (for the user) would be for -CA to retrieve the UTF-16 command-line arguments and convert them to UTF-8 before @ARGV is populated\, though this is complicated by us parsing argv while we're updating it.

Relying only on -CA might also have some backcompat issues\, since if there's Win32 code that blindly sets -CA\, it might break if we start returning the real value. This change could be considered a bug fix though.

To simplify the implementation we might depend on an environment variable instead\, but that has the same sort of problem that setting PERL_UNICODE= by default had on file handling.

While this ticket is about @ARGV and not filenames\, I consider it a duplicate of https://rt.perl.org/Public/Bug/Display.html?id=130831 (see also a recent thread on p5p: https://www.nntp.perl.org/group/perl.perl5.porters/2018/08/msg251899.html).

It's *exactly* the same problem. In this case\, command line arguments are being converted from the console codepage to the system codepage (usually they're not the same!). If a character is invalid or doesn't exist in the system codepage\, it is being replaced with either '?' (most locales) or REPLACEMENT CHARACTER (when *system* codepage is set to 65001\, which is possible only on Windows 10).

While this brings *horrible* experience to perl users\, I don't think it's a bug. Also\, changing this behaviour would obviously break stuff.

p5pRT commented 6 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 6 years ago

From @tonycoz

On Sun\, 02 Sep 2018 19:49:57 -0700\, me@xenu.pl wrote:

While this ticket is about @ARGV and not filenames\, I consider it a duplicate of https://rt.perl.org/Public/Bug/Display.html?id=130831 (see also a recent thread on p5p: https://www.nntp.perl.org/group/perl.perl5.porters/2018/08/msg251899.html).

It's *exactly* the same problem. In this case\, command line arguments are being converted from the console codepage to the system codepage (usually they're not the same!). If a character is invalid or doesn't exist in the system codepage\, it is being replaced with either '?' (most locales) or REPLACEMENT CHARACTER (when *system* codepage is set to 65001\, which is possible only on Windows 10).

It's certainly a strongly related problem\, but it's not the same problem.

It came up while I was diagnosing code that accepted a filename (the name in the test case) and called the Win32 specific APIs to open it and failed.

While this brings *horrible* experience to perl users\, I don't think it's a bug. Also\, changing this behaviour would obviously break stuff.

I think it's something we can improve.

The main issue right now is code that accepts strings from the command-line get nonsensical results - unless the caller does nonsensical things.

The attached patch modifies perl to re-generate argv from the UTF-16 command-line if it sees a -CA switch\, and it works for me for commands run from the command prompt.

However\, one test fails:

run/switchC.t .. 1..15 ok 1 - -CO: no warning on UTF-8 output ok 2 - -C2: no warning on UTF-8 output ok 3 - -CI: read in UTF-8 input ok 4 - -CE: UTF-8 stderr ok 5 - -Co: auto-UTF-8 open for output ok 6 - -Ci: auto-UTF-8 open for input ok 7 - -Ci: auto-UTF-8 open for input affects the current file ok 8 - -Ci: auto-UTF-8 open for input has file scope # Failed test 9 - -CA: @ARGV at run/switchC.t line 78 # got '196' not ok 9 - -CA: @ARGV# expected /(?^s:^256(?:\r?\n)?$)/

ok 10 - \#!perl -C ok 11 - \#!perl -C followed by another switch ok 12 - \#!perl -C\ followed by another switch ok 13 - \#!perl -C with different -C on command line ok 14 - \#!perl -C but not command line ok 15 - perl -C00 [perl \#123991] Failed 1/15 subtests

This fails because backticks have the same type of problem - backticks don't know about unicode either. By the time the chr(256) gets to win32_popen() all it sees is "\xc4\c80"\, and it can't tell if that was Latin1 or UTF-8\, so it's passed through as 00C4 0080 rather than 0100.

Tony

p5pRT commented 6 years ago

From @tonycoz

0001-perl-133496-convert-utf-16-arguments-to-utf-8-if-CA-.patch

```diff From 3c90673de0140195fa62034ce7b49f7fd9949d8c Mon Sep 17 00:00:00 2001 From: Tony Cook Date: Wed, 5 Sep 2018 13:24:30 +1000 Subject: (perl #133496) convert utf-16 arguments to utf-8 if -CA set perl generally uses the ANSI/multi-byte APIs on Win32, which means that the Unicode values that Windows keeps can be corrupted when accepted by perl. When the -CA switch is supplied this change refetches and splits the UTF-16 the command line from the system and converts each argument to UTF-8, replacing the original argv. This has two effects: - any other following arguments, especially including -[eE] code, are encoded as UTF-8, so -Mutf8 works sensibly - @ARGV is properly populated with Unicode arguments instead of the ANSI-fied version of the UTF-16 arguments. With this change, the test for -CA in run/switchC.t fails, since it ends up passes "\xc4\x80" to system, which ends up as UTF-16 00c4 0080 rather than 0100. --- perl.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 61 insertions(+), 1 deletion(-) diff --git a/perl.c b/perl.c index 30ee577483..10a3510c11 100644 --- a/perl.c +++ b/perl.c @@ -42,6 +42,10 @@ #include "nwutil.h" #endif +#ifdef WIN32 +#include +#endif + #ifdef DEBUG_LEAKING_SCALARS_FORK_DUMP # ifdef I_SYSUIO # include @@ -1557,6 +1561,55 @@ Perl_call_atexit(pTHX_ ATEXIT_t fn, void *ptr) ++PL_exitlistlen; } +#ifdef WIN32 + +#define win32_upgrade_argv(pargv) S_win32_upgrade_argv(aTHX_ (pargv)) + +STATIC void +S_win32_upgrade_argv(pTHX_ char ***pargv) { + LPWSTR cmdline = GetCommandLineW(); + int num_args; + LPWSTR *argsw = CommandLineToArgvW(cmdline, &num_args); + STRLEN argsc_sz = + WideCharToMultiByte(CP_UTF8, 0, cmdline, -1, + NULL, 0, 0, 0); + /* maybe combine these into one block */ + /* FIXME: leaking this memory for now */ + LPSTR *argvc = safemalloc(sizeof(LPSTR) * (num_args+1) + argsc_sz); + LPSTR argsc = (LPSTR)(argvc + num_args+1); + LPSTR argsc_end = argsc + argsc_sz; + LPSTR dest = argsc; + int i; + + for (i = 0; i < num_args; ++i) { + argvc[i] = dest; + dest += WideCharToMultiByte(CP_UTF8, 0, argsw[i], -1, + dest, argsc_end - dest, 0, 0); + } + argvc[i] = NULL; + + LocalFree(argsw); + + /* this assumes both the CRT and CommandLineToArgW() split + the command-line in the same way + */ + assert(num_args == PL_origargc); + *pargv = argvc + (*pargv - PL_origargv); + + SAVEFREEPV(argvc); +} + +#define WIN32_UPGRADE_ARGV \ + STMT_START { \ + if (!replaced_args && (PL_unicode & PERL_UNICODE_ARGV_FLAG)) { \ + win32_upgrade_argv(&argv); \ + replaced_args = TRUE; \ + } \ + } STMT_END +#else +#define WIN32_UPGRADE_ARGV NOOP +#endif + /* =for apidoc Am|int|perl_parse|PerlInterpreter *my_perl|XSINIT_t xsinit|int argc|char **argv|char **env @@ -2034,6 +2087,9 @@ S_parse_body(pTHX_ char **env, XSINIT_t xsinit) SV *linestr_sv = NULL; bool add_read_e_script = FALSE; U32 lex_start_flags = 0; +#ifdef WIN32 + bool replaced_args = FALSE; +#endif PERL_SET_PHASE(PERL_PHASE_START); @@ -2072,8 +2128,9 @@ S_parse_body(pTHX_ char **env, XSINIT_t xsinit) case 'W': case 'X': case 'w': - if ((s = moreswitches(s))) + if ((s = moreswitches(s))) { goto reswitch; + } break; case 't': @@ -2192,6 +2249,7 @@ S_parse_body(pTHX_ char **env, XSINIT_t xsinit) default: Perl_croak(aTHX_ "Unrecognized switch: -%s (-h will show valid options)",s); } + WIN32_UPGRADE_ARGV; } } @@ -2265,6 +2323,7 @@ S_parse_body(pTHX_ char **env, XSINIT_t xsinit) #endif } else { moreswitches(d); + WIN32_UPGRADE_ARGV; } } } @@ -2387,6 +2446,7 @@ S_parse_body(pTHX_ char **env, XSINIT_t xsinit) linestr_sv = newSV_type(SVt_PV); lex_start_flags |= LEX_START_COPIED; find_beginning(linestr_sv, rsfp); + WIN32_UPGRADE_ARGV; if (cddir && PerlDir_chdir( (char *)cddir ) < 0) Perl_croak(aTHX_ "Can't chdir to %s",cddir); } -- 2.14.1.windows.1 ```

p5pRT commented 6 years ago

From @tonycoz

On Tue\, 04 Sep 2018 20:40:16 -0700\, tonyc wrote:

While this brings *horrible* experience to perl users\, I don't think it's a bug. Also\, changing this behaviour would obviously break stuff.

I think it's something we can improve.

The main issue right now is code that accepts strings from the command-line get nonsensical results - unless the caller does nonsensical things.

The attached patch modifies perl to re-generate argv from the UTF-16 command-line if it sees a -CA switch\, and it works for me for commands run from the command prompt.

Also\, it breaks embedding\, so don't apply this patch.

Maybe an alternative is to not make it depend on the -CA switch\, but on the current code page.

If the current code page is 65001 then main() (in win32/runperl.c) could do the conversion to utf-8 I do in my patch.

The program then depends on the normal -CA behaviour to treat that as UTF-8\, so perl code sees Unicode in @ARGV.

It does mean that a user has to do something unusual (chcp 65001) to get reasonable behaviour.

Tony

p5pRT commented 6 years ago

From @khwilliamson

On 09/05/2018 05:20 PM\, Tony Cook via RT wrote:

On Tue\, 04 Sep 2018 20:40:16 -0700\, tonyc wrote:

While this brings *horrible* experience to perl users\, I don't think it's a bug. Also\, changing this behaviour would obviously break stuff.

I think it's something we can improve.

The main issue right now is code that accepts strings from the command-line get nonsensical results - unless the caller does nonsensical things.

The attached patch modifies perl to re-generate argv from the UTF-16 command-line if it sees a -CA switch\, and it works for me for commands run from the command prompt.

Also\, it breaks embedding\, so don't apply this patch.

Maybe an alternative is to not make it depend on the -CA switch\, but on the current code page.

If the current code page is 65001 then main() (in win32/runperl.c) could do the conversion to utf-8 I do in my patch.

The program then depends on the normal -CA behaviour to treat that as UTF-8\, so perl code sees Unicode in @ARGV.

It does mean that a user has to do something unusual (chcp 65001) to get reasonable behaviour.

Tony

--- via perlbug: queue: perl5 status: open https://rt-archive.perl.org/perl5/Ticket/Display.html?id=133496

I haven't looked at this thread in detail\, but using script runs can be used to disambiguate some things. Perhaps that would help\, so think about that possibility. I have unreleased code for Pod::Simple that improves CP1252 vs UTF-8 detection that might be instructive.

p5pRT commented 6 years ago

From @xenu

On Wed\, 05 Sep 2018 16:20:03 -0700 "Tony Cook via RT" \perlbug\-followup@perl\.org wrote:

Also\, it breaks embedding\, so don't apply this patch.

Maybe an alternative is to not make it depend on the -CA switch\, but on the current code page.

If the current code page is 65001 then main() (in win32/runperl.c) could do the conversion to utf-8 I do in my patch.

The program then depends on the normal -CA behaviour to treat that as UTF-8\, so perl code sees Unicode in @ARGV.

It does mean that a user has to do something unusual (chcp 65001) to get reasonable behaviour.

Tony

You mean the console codepage? There are some problem with that approach.

Console codepages don't exist in windows subsystem applications (like wperl.exe)\, GetConsoleCP() returns 0 in them:

C:\Users\xenu>wperl -MWin32 -E "open my($fh)\, '>'\, 'a.txt'; print {$fh} Win32::GetConsoleCP()" C:\Users\xenu>type a.txt 0

Another problem is that it won't cover situations where it's impossible to change console codepage\, for example when perl.exe is launched via explorer.exe (e.g. via .lnk shortcut or when some file extension is associated with a perl script).

I think that the only reasonable way to fix the win32 unicode bug is to introduce a way to globally force utf-8 everywhere\, i.e. @ARGV\, filenames and env variables. -C flag used to serve this exact purpose[1]\, but this functionality was removed in 5.8.1.

IMO we should reintroduce that switch.

On second thought\, I think\, in the long run\, we should enable unicode handling by default and add a switch which would restore the old behaviour for scripts that rely on it. IMO that would be the most reasonable approach\, because the current behaviour is *completely* broken and I'm pretty sure that changing it would fix more code than it would break.

[1] - https://perldoc.pl/5.8.0/perlrun#-C

p5pRT commented 6 years ago

From @tonycoz

On Thu\, 06 Sep 2018 11:44:09 -0700\, me@xenu.pl wrote:

On Wed\, 05 Sep 2018 16:20:03 -0700 "Tony Cook via RT" \perlbug\-followup@perl\.org wrote:

Also\, it breaks embedding\, so don't apply this patch.

Maybe an alternative is to not make it depend on the -CA switch\, but on the current code page.

If the current code page is 65001 then main() (in win32/runperl.c) could do the conversion to utf-8 I do in my patch.

The program then depends on the normal -CA behaviour to treat that as UTF-8\, so perl code sees Unicode in @ARGV.

It does mean that a user has to do something unusual (chcp 65001) to get reasonable behaviour.

Tony

You mean the console codepage? There are some problem with that approach.

Console codepages don't exist in windows subsystem applications (like wperl.exe)\, GetConsoleCP() returns 0 in them:

C:\Users\xenu>wperl -MWin32 -E "open my($fh)\, '>'\, 'a.txt'; print {$fh} Win32::GetConsoleCP()" C:\Users\xenu>type a.txt 0

Another problem is that it won't cover situations where it's impossible to change console codepage\, for example when perl.exe is launched via explorer.exe (e.g. via .lnk shortcut or when some file extension is associated with a perl script).

I think that the only reasonable way to fix the win32 unicode bug is to introduce a way to globally force utf-8 everywhere\, i.e. @ARGV\, filenames and env variables. -C flag used to serve this exact purpose[1]\, but this functionality was removed in 5.8.1.

The argv handling looks similar to what my patch does - with the same problem for embedding.

The wide system calls handling appears to assume all SVs are UTF-8 encoded\, even without the SVf_UTF8 flag set.

IMO we should reintroduce that switch.

On second thought\, I think\, in the long run\, we should enable unicode handling by default and add a switch which would restore the old behaviour for scripts that rely on it. IMO that would be the most reasonable approach\, because the current behaviour is *completely* broken and I'm pretty sure that changing it would fix more code than it would break.

I think fixing @ARGV would be reasonably painless for backcompat\, but the rest wide-character support is too likely to break things\, I think.

I wonder how much CPAN testing was done with -C for perl 5.8.0.

Tony

Perl / perl5

@ARGV, -CA and Win32 #16681

From @tonycoz

Created by @tonycoz

From @xenu

From @tonycoz

From @tonycoz

From @tonycoz

From @khwilliamson

From @xenu

From @tonycoz