JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.41k stars 5.45k forks source link

Unicode case functions don't handle special conventions correctly #19516

Open helgee opened 7 years ago

helgee commented 7 years ago

As previously discussed in #19469

The lowercase, uppercase, and the not yet merged titlecase function do not handle the special casing conventions outlined in UTR#21 correctly.

Examples

julia> lowercase("OΔΥΣΣΕΥΣ")
"oδυσσευσ" # wrong, uses the non-final sigma
"oδυσσευς" # would be correct, uses the final sigma

EDIT (2021/03/19): This example has become obsolete due to a 2017 change in German orthography.

julia> uppercase("Spaß")
"SPAß" # wrong
"SPASS" # would have been correct until 2017
stevengj commented 7 years ago

utf8proc implements case-folding, but I don't think it has the info for UTR21? Might require a patch to utf8proc?

stevengj commented 7 years ago

See also JuliaLang/utf8proc#54

helgee commented 3 years ago

The second example works nowadays because German orthography was changed in 2017 to include ẞ which is an uppercase ß.

julia> versioninfo()
Julia Version 1.5.4
Commit 69fcb5745b (2021-03-11 19:13 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, icelake-client)
Environment:
  JULIA_PKG_DEVDIR = /Users/helge/projects/julia

julia> uppercase("spaß")
"SPAẞ" # correct
stevengj commented 2 months ago

See also:

julia> Unicode.normalize("Spaß", casefold=true)
"spass"