Open Seelengrab opened 5 months ago
I think we should clearly be returning false
here, similar to malformed characters.
Malformed chars can never get passed to utf8proc in the first place — if there is no way to convert them to a UInt32
codepoint, you can't pass them to the utf8proc API.
On invalid codepoints, utf8proc_isupper(codepoint)
should return false.
Isn't this a bug in ismalformed
? If it can't be converted to a codepoint, isn't it malformed?
Or should we have another predicate in this case, where it's failing because it is an overlong encoding (Base.is_overlong_enc
is returning true
in UInt32(c)
)?
should just be calling isvalid(c)
instead of ismalformed(c)
?
Or better yet just (ismalformed(c) | isoverlong(c))
since utf8proc checks for the other cases.
Or better yet, shouldn't we have a predicate
hascodepoint(c::AbstractChar) = !(ismalformed(c) | isoverlong(c))
to check whether one can call codepoint(c)
(== UInt32(c)
)?
As discussed in #54393, the conclusion is that codepoint(c)
should succeed whenever !ismalformed(c)
, including for overlong encodings.
MWE:
Either this is a requirement, or we can safely return
false
here, as is done for malformed characters. Does utf8proc handle invalid/malformed chars on its own? The docs aren't clear about this.