`isuppercase`/`islowercase` fail on invalid characters

JuliaLang / julia

The Julia Programming Language

https://julialang.org/

MIT License

45.54k stars 5.47k forks source link

`isuppercase`/`islowercase` fail on invalid characters #54343

Open Seelengrab opened 5 months ago

Seelengrab commented 5 months ago

MWE:

julia> isuppercase('\xf0\x8e\x80\x80')
ERROR: Base.InvalidCharError{Char}('\xf0\x8e\x80\x80')
Stacktrace:
 [1] throw_invalid_char(c::Char)
   @ Base ./char.jl:86
 [2] UInt32
   @ ./char.jl:133 [inlined]
 [3] isuppercase(c::Char)
   @ Base.Unicode ./strings/unicode.jl:403
 [4] top-level scope
   @ REPL[12]:1

julia> Base.ismalformed('\xf0\x8e\x80\x80')
false

Either this is a requirement, or we can safely return false here, as is done for malformed characters. Does utf8proc handle invalid/malformed chars on its own? The docs aren't clear about this.

stevengj commented 5 months ago

I think we should clearly be returning false here, similar to malformed characters.

Malformed chars can never get passed to utf8proc in the first place — if there is no way to convert them to a UInt32 codepoint, you can't pass them to the utf8proc API.

On invalid codepoints, utf8proc_isupper(codepoint) should return false.

stevengj commented 5 months ago

Isn't this a bug in ismalformed? If it can't be converted to a codepoint, isn't it malformed?

Or should we have another predicate in this case, where it's failing because it is an overlong encoding (Base.is_overlong_enc is returning true in UInt32(c))?

stevengj commented 5 months ago

Maybe https://github.com/JuliaLang/julia/blob/dbf0bab59ddc28f1c240fa618bf0e23194954bbe/base/strings/unicode.jl#L414-L415

should just be calling isvalid(c) instead of ismalformed(c)?

Or better yet just (ismalformed(c) | isoverlong(c)) since utf8proc checks for the other cases.

Or better yet, shouldn't we have a predicate

hascodepoint(c::AbstractChar) = !(ismalformed(c) | isoverlong(c))

to check whether one can call codepoint(c) (== UInt32(c))?

stevengj commented 2 months ago

As discussed in #54393, the conclusion is that codepoint(c) should succeed whenever !ismalformed(c), including for overlong encodings.