Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.85k stars 527 forks source link

Locale 'Chinese (Simplified)_China.936' is unsupported #21562

Closed jzy-chitong56 closed 2 weeks ago

jzy-chitong56 commented 7 months ago

5.38.0.1 win11 23H2

run this show Locale 'Chinese (Simplified)_China.936' is unsupported https://github.com/SMUnlimited/AMAI/blob/master/MakeTFTBase.bat

I have never seen this issue with the old version I installed before

khwilliamson commented 7 months ago

I'm unsure what you mean since you added the "documentation" label.

The only multi-byte encoding perl supports is UTF-8. CP936 is a variable length encoding of 1 or 2 bytes. Early in the perllocale document, it says

Perl supports single-byte locales that are supersets of ASCII, such as the ISO 8859 ones, and one multi-byte-type locale, UTF-8. Perl doesn't support any other multi-byte locales, such as the ones for East Asian languages.

Previously, perl did not detect that such a locale was being used. Depending on what one is doing with the encoding, it might work for limited subsets of inputs, or it could lead to otherwise unexplained crashes. That's why the warning was added.

There is a way to use these kinds of encodings with perl, and that is to translate them to UTF-8 on input, and translate back on output. The Encode::CN module understands 936 and is documented to be usable for this purpose.

In order to get perl to natively handle this encoding, someone would have to care enough to submit an extensive patch, which would be too large to go in without lots of back-and-forth communication as it was being developed. I doubt it would be looked on favorably. as UTF-8 is the international standard going forth. CP936, according to Wikipedia, was used by 1% of web servers in China as of a year ago. It looks like UTF-8 is used by over 90%.

jzy-chitong56 commented 7 months ago

Sorry, the web version is a bit difficult for Chinese users. Please forgive the label error. I just want to seek a solution because it has never appeared before and the compilation is normal

tonycoz commented 7 months ago

Do we need to support other multi-byte locales beyond treating them as a sequence of bytes?

One problem with this warning is for East Asian installations of Windows the default system encoding is going to be one of these non-UTF-8 MBCS encodings, and for normal users changing this is going to be obscure (and possibly make the UI unreadable, since it changes the display language too.)

It's possible to set the system encoding to UTF-8, but at least in Windows 10 this is labelled as a beta.

Normal users in an enterprise setting likely won't have access to this setting since it requires administrative access.

khwilliamson commented 7 months ago

It would be a lot of work to support such locales, and they are being replaced by UTF-8 in the wild.

The warning message is legitimate. But it only applies if in the scope of 'use locale' or one is using one of the POSIX:: functions that are locale-dependent, or one is using a module whose XS code uses a locale-dependent libc function. The module I'm aware of that ships with core that does this is Time::Piece.

One option that is starting to appeal to me is to suppress this warning on initialization. But it would be nice to be warned before actually trying to use such a locale. We can't police XS code's use, but we could keep a hash of all such locales that have been warned about (typically empty), and when a function is about to be used that is problematic warn once and add to the hash.

A future commit in the pipeline already knows about all such functions in POSIX and those directly used by the perl core.

jzy-chitong56 commented 7 months ago

One option that is starting to appeal to me is to suppress this warning on initialization. But it would be nice to be warned before actually trying to use such a locale. We can't police XS code's use, but we could keep a hash of all such locales that have been warned about (typically empty), and when a function is about to be used that is problematic warn once and add to the hash.

The prompt should be prompted during program startup or installation, and the user should be able to control the shutdown prompt when they know what they are doing

But here it is running using POWERSHELL or cmd(bat), which has been displayed in the compilation log. I am not sure if this can trigger the shutdown of PERL

tonycoz commented 5 months ago

It would be a lot of work to support such locales, and they are being replaced by UTF-8 in the wild.

I doubt they will be replaced by default on Windows, since it would break backward compatibility.

Could you please describe why the crashes occur? Preferably with a specific case.

I'm trying to understand why we're producing this message, I don't see anything specific in the commit that added the message, nor in the commit it references.

khwilliamson commented 5 months ago

Do we need to support other multi-byte locales beyond treating them as a sequence of bytes?

If we just treated them as a sequence of bytes, then things like \w wouldn't work, nor uc(), etc. Perl was designed as one byte means one char, and UTF-8 support was later shoe-horned in, by using, in part, the UTF-8 flag on scalars. It would be a big deal to add additional options like that. To support single-byte locales, we use things like isalnum(0, toupper(). If we were to support two-byte locales, we'd have to change to use iswalnum towupper and either convert on the fly or change many of our U8 data types to U16.

Looking through open tickets, I found these (some of which may be duplicates)

https://github.com/Perl/perl5/issues/16362 https://github.com/Perl/perl5/issues/10258 https://github.com/Perl/perl5/issues/11133 https://github.com/Perl/perl5/issues/13668

But I saw nothing involving segfaults. One can imagine with a shift state locale that things would go awry quickly; less so in CP 936, which is two-byte From the commit message below, I did run into them, but didn't keep an example.

The message is output in S_new_ctype() in locale.c. It is output when the locale is more than a single byte and isn't UTF-8. A relevant commit is


 Author: Karl Williamson <khw@cpan.org>
 Date:   Mon Apr 12 05:16:56 2021 -0600
 Commit:     Karl Williamson <khw@cpan.org>
 CommitDate: Wed Aug 31 08:37:01 2022 -0600

    Add locale unsupported test

     Perl only supports multi-byte locales that are UTF-8.  It turns out that
     the others are worse than I thought, and if someone switches to one, the
     program can crash.

     This commit generates a default-on diagnostic when switching into such a
     locale.

     The check has been done in various releases for some time, but this
     elevates its severity.````
jzy-chitong56 commented 5 months ago

@khwilliamson HI, If resolved the issue, could you please update Strawberry Perl and ActiveState Perl thank you

and Wishing Merry Christmas

khwilliamson commented 5 months ago

As an experiment, I created a branch with that warning suppressed. To my surprise my Linux box did not immediately segfault. I then created a smoke branch, and you can see the results here: url

I looked in more detail at FreeBSD; it segfaulted in that locale, doing collation.

jkeenan commented 5 months ago

@khwilliamson HI, If resolved the issue, could you please update Strawberry Perl and ActiveState Perl thank you

and Wishing Merry Christmas

Neither Strawberry Perl nor Active State Perl is maintained by the Perl 5 Porters. Those distributions, which contain considerable code apart from what is maintained here in the Perl core distribution, are updated periodically by their downstream maintainers. In the case of Strawberry Perl, that will probably get a new production release sometime after the release of perl-5.40.0 scheduled for May 2024.

khwilliamson commented 5 months ago

On 12/13/23 04:25, James E Keenan wrote:

@khwilliamson <https://github.com/khwilliamson> HI, If resolved the
issue, could you please update Strawberry Perl and ActiveState Perl
thank you

and Wishing Merry Christmas

Neither Strawberry Perl nor Active State Perl is maintained by the Perl 5 Porters. Those distributions, which contain considerable code apart from what is maintained here in the Perl core distribution, are updated periodically by their downstream maintainers. In the case of Strawberry Perl, that will probably get a new production release sometime after the release of perl-5.40.0 scheduled for May 2024.

I'm hoping to get something to address this issue in a 5.38 dot release. Would they likely issue something based on that?

shawnlaffan commented 5 months ago

I'm hoping to get something to address this issue in a 5.38 dot release. Would they likely issue something based on that?

We have built Strawberry Perl 5.38.2 and can build new versions as they are released.

anemochore commented 2 months ago

I'm using Win 11 Korean, and Perl v5.38.2 shows the warning regardless of active codepage (both 949 and 65001). In contrast, v5.34.0 never shows the warning even in 949. I think the warning message (and the actual warning behavior whatsoever) should be gone when using 65001 codepage. Am I right? Below are two outputs.

v5.38.2

C:\Strawberry\perl\bin>chcp
Active code page: 65001

C:\Strawberry\perl\bin>perl -v
Locale 'Korean_Korea.949' is unsupported, and may crash the interpreter.

This is perl 5, version 38, subversion 2 (v5.38.2) built for MSWin32-x64-multi-thread

v5.34.0

C:\Program Files\Git\usr\bin>chcp
활성 코드 페이지: 949

C:\Program Files\Git\usr\bin>perl -v

This is perl 5, version 34, subversion 0 (v5.34.0) built for x86_64-msys-thread-multi
khwilliamson commented 2 months ago

This is on my list of must-fixes for 5.40

MasterInQuestion commented 3 weeks ago

    Recommendation:     Disable this check on Windows:     As Windows typically default to some randomly defined charset.     And Windows Perl before (without the warning) didn't seem to exhibit any real issue.     .     Such warning on Windows tends to be just noise.

khwilliamson commented 3 weeks ago

22160 is intended to fix this. I don't have access to a Windows box that I can reproduce this on, so I would appreciate it if someone were to try it out. @jzy-chitong56 @anemochore

shawnlaffan commented 3 weeks ago

Link to PR: https://github.com/Perl/perl5/pull/22160

tonycoz commented 3 weeks ago

I don't have access to a Windows box that I can reproduce this on

It should be reproducible on any windows box, though you will need to set the system language to a probably unfamiliar language to do so.

Settings | Time and Language | Language

You may need to add a language at this point (under preferred languages).

Same page, "Administrative language settings", "Administrative" tab, "Change system locale", select a locale (I used Japanese), and reboot.

Once you've done testing go through the same sequence to switch back (this is the fun part).

jzy-chitong56 commented 3 weeks ago

22160 is intended to fix this. I don't have access to a Windows box that I can reproduce this on, so I would appreciate it if someone were to try it out. @jzy-chitong56 @anemochore

As I said, I can only use the Strawberry version, which is not updated. I really don't know how to test it. I'm not a real programmer, I just rely on this environment package when compiling files

I'm sorry I can't help, but I still hope to update it to the Strawberry version as soon as possible. Thank you

khwilliamson commented 2 weeks ago

Fixed via #22160