Open afresh1 opened 1 year ago
I suspect that isdigit itself should not be used here. I'll quote https://github.com/Perl/perl5/pull/20649#issuecomment-1364754327 here:
My question is why is the code using isdigit() in the first place? To quote perllocale
By default, Perl itself (outside the POSIX module) ignores the current locale. The "use locale" pragma tells Perl to use the current locale for some operations.
People are not generally expecting locale-dependent behaviour unless it is explicitly asked for, or documented to always be there, as just a few POSIX module operations do.
Also note the contradiction in the OpenBSD man page that this ticket links to
The complete list of decimal digits is 0 and 1–9, in any locale
and later
On systems supporting non-ASCII single-byte character encodings, different c arguments may correspond to the digits, and the results of isdigit() may depend on the LC_CTYPE locale
ISO 8859-11 contains the Thai characters for 0-9 at positions F0-F9. (Unicode U+0E50 - U+0E59). This is the only single-byte locale I know of that has a second set of digits in it.
Perl furnishes the macros isDIGIT
for when not in use locale
and isDIGIT_LC
for when in it. I have been considering writing a set of macros that XS code can call to DTRT without having to worry about being in use locale
scope or not. I haven't come up with a good name paradigm. Suggestions welcome.
Also note that OpenBSD is probably not the best place to find accurate locale-related documentation. This commit message gives a decent explanation, in my opinion:
commit be552ced7556db8582622a745067e1b266e8ca91 Author: Karl Williamson khw@cpan.org Date: Wed Aug 24 20:47:55 2022 -0600
run/locale.t: skip illegal locale test when all are legal
OpenBSD believes that locales are so inherently insecure that it only internally has two locales: C and C.UTF-8, which it knows how to treat securely.
However, to preserve at least the illusion of portability, the locale setting operations do not fail whatever garbage is passed as the locale name. (There may be unacceptable strings, but any reasonable locale name and a lot of unreasonable ones are accepted.) If the input name looks like it was meant to have "UTF-8" in its name, the locale is set to C.UTF-8. Otherwise, it is set to C. Early releases had bugs in switching between the two.
Whether isdigit()
is the correct hammer for this nail is a separate question. As discussed over in the related Bzip
PR:
As it happens, upstream already has a ticket for this issue -- see https://sourceware.org/bugzilla/show_bug.cgi?id=28283. It was created in August 2021. I'll add a comment to the ticket.
Also - found a good reference for making the change -- https://wiki.sei.cmu.edu/confluence/display/c/STR37-C.+Arguments+to+character-handling+functions+must+be+representable+as+an+unsigned+char
It seems fairly clear that using isdigit()
without the cast is wrong everywhere, not just in OpenBSD.
Theo noticed that these isdigit calls were not obviously correct. He suggested reporting this upstream. The only portable safe way is to always cast to (unsigned char).
http://man.openbsd.org/isdigit#CAVEATS
Originally filed as https://github.com/Perl/perl5/pull/20649