Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.98k stars 559 forks source link

can't read pinyin characters from terminal #13668

Open p5pRT opened 10 years ago

p5pRT commented 10 years ago

Migrated from rt.perl.org#121450 (status was 'open')

Searchable as RT121450$

p5pRT commented 10 years ago

From ntysdd@gmail.com

Created by ntysdd@gmail.com

Using strawberryperl portable under a simplified Chinese env.(CP936) Found perl can't read pinyin chars properly from a terminal.

Example​:

perl -ne "print" nǐtàiyánsù n t iy ns

Chinese characters are OK. Reading from a file using redirection is also OK. Only terminal plus pinyin will get wrong.

Perl Info ``` Flags: category=core severity=low Site configuration information for perl 5.18.2: Configured by strawberry-perl at Tue Jan 7 16:32:09 2014. Summary of my perl5 (revision 5 version 18 subversion 2) configuration: Platform: osname=MSWin32, osvers=6.2, archname=MSWin32-x86-multi-thread-64int uname='Win32 strawberry-perl 5.18.2.1 #1 Tue Jan 7 16:30:36 2014 i386' config_args='undef' hint=recommended, useposix=true, d_sigaction=undef useithreads=define, usemultiplicity=define useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=define, use64bitall=undef, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='gcc', ccflags =' -s -O2 -DWIN32 -DPERL_TEXTMODE_SCRIPTS -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -fno-strict-aliasing -mms-bitfields', optimize='-s -O2', cppflags='-DWIN32' ccversion='', gccversion='4.7.3', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long long', ivsize=8, nvtype='double', nvsize=8, Off_t='long long', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='g++.exe', ldflags ='-s -L"F:\mono\perl\perl\lib\CORE" -L"F:\mono\perl\c\lib"' libpth=F:\mono\perl\c\lib F:\mono\perl\c\i686-w64-mingw32\lib F:\mono\perl\c\lib\gcc\i686-w64-mingw32\4.7.3 libs=-lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32 perllibs=-lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32 libc=, so=dll, useshrplib=true, libperl=libperl518.a gnulibc_version='' Dynamic Linking: dlsrc=dl_win32.xs, dlext=dll, d_dlsymun=undef, ccdlflags=' ' cccdlflags=' ', lddlflags='-mdll -s -L"F:\mono\perl\perl\lib\CORE" -L"F:\mono\perl\c\lib"' Locally applied patches: @INC for perl 5.18.2: F:/mono/perl/perl/site/lib F:/mono/perl/perl/vendor/lib F:/mono/perl/perl/lib . Environment for perl 5.18.2: HOME (unset) LANG=zh_CN LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=F:\mono\perl\perl\site\bin;F:\mono\perl\perl\bin;F:\mono\perl\c\bin;C:\Program Files\Broadcom\Broadcom 802.11 Network Adapter;;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files\Windows Kits\8.1\Windows Performance Toolkit\;C:\Program Files\Microsoft SQL Server\110\Tools\Binn\;C:\Program Files\GNU\GnuPG\pub PERL_BADLANG (unset) SHELL (unset) ```
p5pRT commented 10 years ago

From @jkeenan

Can anyone familiar with CP936 reproduce this?

p5pRT commented 10 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 10 years ago

From @khwilliamson

I'm trying to understand this report. I am not familiar with CP936\, but I looked it up\, and it is a one and two byte encoding. Perl supports internally only single byte encodings\, plus\, starting in 5.20\, UTF-8. So this encoding shouldn't be expected to work in Perl. What one is supposed to do is to use the Encode module to translate the encoding into Perl's internal form on input\, and transform back on output. An example I found is http://www.perlmonks.org/?node_id=537416

p5pRT commented 10 years ago

From @ikegami

I'll see what I can find out tonight. Can you please provide the output of the following in the meantime?

chcp & perl -MWin32 -MWin32​::Console -E"say for Win32​::GetACP()\, Win32​::GetOEMCP()\, Win32​::Console->new(STD_INPUT_HANDLE)->InputCP()\, Win32​::Console->new(STD_OUTPUT_HANDLE)->OutputCP();"

p5pRT commented 10 years ago

From @ikegami

I haven't found anything that helps you. Still waiting on your feedback. Would also like to see the output of perl -ne"printf qq{%v02X\n}\, $_" for that same input.

p5pRT commented 10 years ago

From ntysdd@gmail.com

活动代码页​: 936 936 936 936 936

p5pRT commented 10 years ago

From ntysdd@gmail.com

活动代码页​: 936 936 936 936 936

p5pRT commented 10 years ago

From @tonycoz

On Sun Mar 16 00​:41​:07 2014\, ntysdd@​gmail.com wrote​:

Using strawberryperl portable under a simplified Chinese env.(CP936) Found perl can't read pinyin chars properly from a terminal.

Example​:

perl -ne "print" nǐtàiyánsù n t iy ns

Chinese characters are OK. Reading from a file using redirection is also OK. Only terminal plus pinyin will get wrong.

I wonder if this is related to #13794

Tony

p5pRT commented 10 years ago

From @ikegami

On Mon\, Jul 7\, 2014 at 5​:14 AM\, Tony Cook via RT \perlbug\-followup@​perl\.org wrote​:

On Sun Mar 16 00​:41​:07 2014\, ntysdd@​gmail.com wrote​:

Using strawberryperl portable under a simplified Chinese env.(CP936) Found perl can't read pinyin chars properly from a terminal.

Example​:

perl -ne "print" nǐtàiyánsù n t iy ns

Chinese characters are OK. Reading from a file using redirection is also OK. Only terminal plus pinyin will get wrong.

I wonder if this is related to https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121783

No. The non-ASCII chars are filtered out on or before input. It's not an output issue.

The program is getting a NUL where the non-ASCII chars as suppose to be (6E.00.74.00.69.79.00.6E.73.00.0A). I have no idea why.

p5pRT commented 10 years ago

From @khwilliamson

On 07/07/2014 09​:25 AM\, Eric Brine wrote​:

On Mon\, Jul 7\, 2014 at 5​:14 AM\, Tony Cook via RT \<perlbug-followup@​perl.org \mailto&#8203;:perlbug\-followup@&#8203;perl\.org> wrote​:

On Sun Mar 16 00&#8203;:41&#8203;:07 2014\, ntysdd@&#8203;gmail\.com
\<mailto&#8203;:ntysdd@&#8203;gmail\.com> wrote&#8203;:
 > Using strawberryperl portable under a simplified Chinese env\.\(CP936\)
 > Found perl can't read pinyin chars properly from a terminal\.
 >
 > Example&#8203;:
 > > perl \-ne "print"
 > > nǐtàiyánsù
 > n t iy ns
 >
 > Chinese characters are OK\.
 > Reading from a file using redirection is also OK\.
 > Only terminal plus pinyin will get wrong\.

I wonder if this is related to
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121783

No. The non-ASCII chars are filtered out on or before input. It's not an output issue.

The program is getting a NUL where the non-ASCII chars as suppose to be (6E.00.74.00.69.79.00.6E.73.00.0A). I have no idea why.

I'm still having trouble grokking this issue. According to http​://msdn.microsoft.com/en-US/goglobal/cc305153 CP936 is ASCII plus 0x80 means the EURO SIGN. 0xFF is undefined\, and 0x81 - 0xFE start a two byte sequence that give various ideographs.

I don't understand what it might mean to input an accented Latin character when it appears to me that the terminal is not set up to understand them.

p5pRT commented 10 years ago

From @ikegami

On Tue\, Jul 8\, 2014 at 2​:57 PM\, Karl Williamson \public@&#8203;khwilliamson\.com wrote​:

I'm still having trouble grokking this issue.

If I enter "nitàiyánsù" into my cp850 terminal\, I expect to get the cp850 encoding of those characters from STDIN\, and I do.

perl -MEncode -Mcharnames=​:full -nlE"say sprintf '%v02X'\, $_; say charnames​::viacode(ord) for split //\, decode('cp850'\, $_);" nitàiyánsù 6E.69.74.85.69.79.A0.6E.73.97 LATIN SMALL LETTER N LATIN SMALL LETTER I LATIN SMALL LETTER T LATIN SMALL LETTER A WITH GRAVE LATIN SMALL LETTER I LATIN SMALL LETTER Y LATIN SMALL LETTER A WITH ACUTE LATIN SMALL LETTER N LATIN SMALL LETTER S LATIN SMALL LETTER U WITH GRAVE ^Z

He enters "nǐtàiyánsù" into his cp936 terminal. He expects to get the cp936 encoding of those characters from STDIN. He doesn't.

6E.A8.AB.74.A8.A4.69.79.A8.A2.6E.73.A8.B4 is what he expects to get 6E.00. 74.00. 69.79.00. 6E.73.00 is what he gets

p5pRT commented 10 years ago

From @khwilliamson

On 07/08/2014 02​:26 PM\, Eric Brine wrote​:

On Tue\, Jul 8\, 2014 at 2​:57 PM\, Karl Williamson \<public@​khwilliamson.com \mailto&#8203;:public@&#8203;khwilliamson\.com> wrote​:

I'm still having trouble grokking this issue\.

If I enter "nitàiyánsù" into my cp850 terminal\, I expect to get the cp850 encoding of those characters from STDIN\, and I do.

perl -MEncode -Mcharnames=​:full -nlE"say sprintf '%v02X'\, $_; say charnames​::viacode(ord) for split //\, decode('cp850'\, $_);" nitàiyánsù 6E.69.74.85.69.79.A0.6E.73.97 LATIN SMALL LETTER N LATIN SMALL LETTER I LATIN SMALL LETTER T LATIN SMALL LETTER A WITH GRAVE LATIN SMALL LETTER I LATIN SMALL LETTER Y LATIN SMALL LETTER A WITH ACUTE LATIN SMALL LETTER N LATIN SMALL LETTER S LATIN SMALL LETTER U WITH GRAVE ^Z

He enters "nǐtàiyánsù" into his cp936 terminal. He expects to get the cp936 encoding of those characters from STDIN. He doesn't.

6E.A8.AB.74.A8.A4.69.79.A8.A2.6E.73.A8.B4 is what he expects to get 6E.00. 74.00. 69.79.00. 6E.73.00 is what he gets

What I'm saying is there is no encoding in cp936 for those characters.

p5pRT commented 10 years ago

From @ikegami

On Tue\, Jul 8\, 2014 at 4​:48 PM\, Karl Williamson \public@&#8203;khwilliamson\.com wrote​:

On 07/08/2014 02​:26 PM\, Eric Brine wrote​:

On Tue\, Jul 8\, 2014 at 2​:57 PM\, Karl Williamson \<public@​khwilliamson.com \mailto&#8203;:public@&#8203;khwilliamson\.com> wrote​:

I'm still having trouble grokking this issue\.

If I enter "nitàiyánsù" into my cp850 terminal\, I expect to get the cp850 encoding of those characters from STDIN\, and I do.

perl -MEncode -Mcharnames=​:full -nlE"say sprintf '%v02X'\, $_; say charnames​::viacode(ord) for split //\, decode('cp850'\, $_);" nitàiyánsù 6E.69.74.85.69.79.A0.6E.73.97 LATIN SMALL LETTER N LATIN SMALL LETTER I LATIN SMALL LETTER T LATIN SMALL LETTER A WITH GRAVE LATIN SMALL LETTER I LATIN SMALL LETTER Y LATIN SMALL LETTER A WITH ACUTE LATIN SMALL LETTER N LATIN SMALL LETTER S LATIN SMALL LETTER U WITH GRAVE ^Z

He enters "nǐtàiyánsù" into his cp936 terminal. He expects to get the cp936 encoding of those characters from STDIN. He doesn't.

6E.A8.AB.74.A8.A4.69.79.A8.A2.6E.73.A8.B4 is what he expects to get 6E.00. 74.00. 69.79.00. 6E.73.00 is what he gets

What I'm saying is there is no encoding in cp936 for those characters.

$ perl -MEncode -E'use utf8; $_="nǐtàiyánsù"; say sprintf "%v02X"\, encode "cp936"\, $_;' 6E.A8.AB.74.A8.A4.69.79.A8.A2.6E.73.A8.B4

Encode seems to think so?

p5pRT commented 10 years ago

From @ikegami

On Tue\, Jul 8\, 2014 at 5​:58 PM\, Eric Brine \ikegami@&#8203;adaelis\.com wrote​:

On Tue\, Jul 8\, 2014 at 4​:48 PM\, Karl Williamson \public@&#8203;khwilliamson\.com wrote​:

On 07/08/2014 02​:26 PM\, Eric Brine wrote​:

On Tue\, Jul 8\, 2014 at 2​:57 PM\, Karl Williamson \<public@​khwilliamson.com \mailto&#8203;:public@&#8203;khwilliamson\.com> wrote​:

I'm still having trouble grokking this issue\.

If I enter "nitàiyánsù" into my cp850 terminal\, I expect to get the cp850 encoding of those characters from STDIN\, and I do.

perl -MEncode -Mcharnames=​:full -nlE"say sprintf '%v02X'\, $_; say charnames​::viacode(ord) for split //\, decode('cp850'\, $_);" nitàiyánsù 6E.69.74.85.69.79.A0.6E.73.97 LATIN SMALL LETTER N LATIN SMALL LETTER I LATIN SMALL LETTER T LATIN SMALL LETTER A WITH GRAVE LATIN SMALL LETTER I LATIN SMALL LETTER Y LATIN SMALL LETTER A WITH ACUTE LATIN SMALL LETTER N LATIN SMALL LETTER S LATIN SMALL LETTER U WITH GRAVE ^Z

He enters "nǐtàiyánsù" into his cp936 terminal. He expects to get the cp936 encoding of those characters from STDIN. He doesn't.

6E.A8.AB.74.A8.A4.69.79.A8.A2.6E.73.A8.B4 is what he expects to get 6E.00. 74.00. 69.79.00. 6E.73.00 is what he gets

What I'm saying is there is no encoding in cp936 for those characters.

$ perl -MEncode -E'use utf8; $_="nǐtàiyánsù"; say sprintf "%v02X"\, encode "cp936"\, $_;' 6E.A8.AB.74.A8.A4.69.79.A8.A2.6E.73.A8.B4

Encode seems to think so?

And so does the page you linked earlier. Lead byte A8​: http​://msdn.microsoft.com/en-US/goglobal/gg675289

p5pRT commented 6 years ago

From zefram@fysh.org

The encoding should be pretty irrelevant for the test program given. If this were Unix I'd ask to compare perl's behaviour against cat for the same input\, using strace to see what the programs actually get. But being Windows\, that kind of debugging isn't available. I think the weird behaviour seen must be specific to Windows; it doesn't look like Perl behaviour at all.

-zefram

toddr commented 4 years ago

From @tonycoz I wonder if this is related to #13794

Tony

Which was just closed.

ikegami commented 4 years ago

From @tonycoz I wonder if this is related to #13794 Tony

Which was just closed.

As previously stated, it's not related to #13794.

13794 was fixed in Win10.

This problem still happens.

C:\Users\ikegami>chcp 936
Active code page: 936

C:\Users\ikegami>echo nǐtàiyánsù
nǐtàiyánsù

C:\Users\ikegami>echo nǐtàiyánsù | perl -ne"print"
nǐtàiyánsù

C:\Users\ikegami>perl -ne"print"
nǐtàiyánsù     <- pasted in
n t iy ns
^Z

C:\Users\ikegami>echo nǐtàiyánsù | perl -ne"printf qq{%v02X\n}, $_"
6E.C7.90.74.C3.A0.69.79.C3.A1.6E.73.C3.B9.20.0A

C:\Users\ikegami>perl -ne"printf qq{%v02X\n}, $_"
nǐtàiyánsù
6E.00.74.00.69.79.00.6E.73.00.0A
^Z
khwilliamson commented 4 years ago

Thanks for this example.

What happens if in your paste example, you instead set a $scalar to it, and Devel::Peek Dump that scalar?

ikegami commented 4 years ago

Thanks for this example.

What happens if in your paste example, you instead set a $scalar to it, and Devel::Peek Dump that scalar?

As you would expect based on the printf %vX:

C:\Users\ikegami>perl -MDevel::Peek -wne"Dump($_)"
nǐtàiyánsù       <-- pasted in
SV = PV(0x114b8d8) at 0x27bbab0
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x27b55a8 "n\0t\0iy\0ns\0\n"\0
  CUR = 11
  LEN = 81
^Z