Dual-Life / Time-Piece

Object Oriented time objects
Other
15 stars 33 forks source link

Cast isdigit argument as unsigned char #63

Open afresh1 opened 1 year ago

afresh1 commented 1 year ago

Theo noticed that these isdigit calls were not obviously correct. He suggested reporting this upstream. The only portable safe way is to always cast to (unsigned char).

http://man.openbsd.org/isdigit#CAVEATS

The argument c must be EOF or representable as an unsigned char; otherwise, the result is undefined.

Originally filed as https://github.com/Perl/perl5/pull/20649

khwilliamson commented 1 year ago

I suspect that isdigit itself should not be used here. I'll quote https://github.com/Perl/perl5/pull/20649#issuecomment-1364754327 here:

My question is why is the code using isdigit() in the first place? To quote perllocale

By default, Perl itself (outside the POSIX module) ignores the current locale. The "use locale" pragma tells Perl to use the current locale for some operations.

People are not generally expecting locale-dependent behaviour unless it is explicitly asked for, or documented to always be there, as just a few POSIX module operations do.

Also note the contradiction in the OpenBSD man page that this ticket links to

The complete list of decimal digits is 0 and 1–9, in any locale

and later

On systems supporting non-ASCII single-byte character encodings, different c arguments may correspond to the digits, and the results of isdigit() may depend on the LC_CTYPE locale

ISO 8859-11 contains the Thai characters for 0-9 at positions F0-F9. (Unicode U+0E50 - U+0E59). This is the only single-byte locale I know of that has a second set of digits in it.

Perl furnishes the macros isDIGIT for when not in use locale and isDIGIT_LC for when in it. I have been considering writing a set of macros that XS code can call to DTRT without having to worry about being in use locale scope or not. I haven't come up with a good name paradigm. Suggestions welcome.

Also note that OpenBSD is probably not the best place to find accurate locale-related documentation. This commit message gives a decent explanation, in my opinion:

commit be552ced7556db8582622a745067e1b266e8ca91 Author: Karl Williamson khw@cpan.org Date: Wed Aug 24 20:47:55 2022 -0600

 run/locale.t: skip illegal locale test when all are legal
afresh1 commented 1 year ago

Whether isdigit() is the correct hammer for this nail is a separate question. As discussed over in the related Bzip PR:

As it happens, upstream already has a ticket for this issue -- see https://sourceware.org/bugzilla/show_bug.cgi?id=28283. It was created in August 2021. I'll add a comment to the ticket.

Also - found a good reference for making the change -- https://wiki.sei.cmu.edu/confluence/display/c/STR37-C.+Arguments+to+character-handling+functions+must+be+representable+as+an+unsigned+char

It seems fairly clear that using isdigit() without the cast is wrong everywhere, not just in OpenBSD.