Some non-ascii unicode chars are not case-folded correctly.

basvandijk / case-insensitive

Case insensitive string comparison

Other

25 stars 19 forks source link

Some non-ascii unicode chars are not case-folded correctly. #31

Open fisx opened 3 years ago

fisx commented 3 years ago

import qualified Data.CaseInsensitive as CI
import qualified Data.Char as Char

main :: IO ()
main = do
  print ((Char.toLower <$> ("\5042" :: String)) == "\43906")
  print ((CI.foldCase (CI.mk ("\5042" :: String))) == "\43906")

{-
*Main> :main
True
False
-}

Thanks to QuickCheck! :)

fisx commented 3 years ago

Oh, interesting: there is Data.Text.toLower, and Data.Text.toCaseFold. But neither is compatible with CI:

import qualified Data.CaseInsensitive as CI
import qualified Data.Text as Text
import Prelude

main :: IO ()
main = do
  print (Text.toCaseFold "\5042" == "\43906")
  print ((CI.foldCase (CI.mk ("\5042" :: String))) == "\43906")

{-
*Main> :main
True
False
-}

pcapriotti commented 3 years ago

I think this is actually a bug in Text.toLower: Cherokee lowercase letters (e.g. U+AB82) fold to their uppercase counterparts (e.g. U+13B2). This is implemented incorrectly in text, since the fallback case of foldMapping in https://github.com/haskell/text/blob/master/src/Data/Text/Internal/Fusion/CaseMapping.hs converts every character to lowercase. So we get the strange (and incorrect!) behaviour that U+13B2 and U+AB82 map to each other when folding. See https://github.com/haskell/text/issues/277.

fisx commented 3 years ago

It gets weirder:

*Wire.API.User.RichInfo Scim Data.CaseInsensitive> foldCase ("Ꮊ" :: String)
"\43914"
*Wire.API.User.RichInfo Scim Data.CaseInsensitive> foldCase it
"\5050"
*Wire.API.User.RichInfo Scim Data.CaseInsensitive> foldCase it
"\43914"
[...]