basvandijk / case-insensitive

Case insensitive string comparison
Other
25 stars 19 forks source link

CI ByteString is slow #23

Open winterland1989 opened 8 years ago

winterland1989 commented 8 years ago

Constructing a CI ByteString will ask for pinned memory, but usually the ByteString is short so this behavior not only add overhead but contribute to heap fragment. I think we can do better here, any idea?

basvandijk commented 8 years ago

Since we have to construct a new ByteString to foldCase the original we can't avoid asking for pinned memory.

What we could do is add an instance FoldCase ShortByteString. Care to write PR?

winterland1989 commented 8 years ago

OK, I'll send one. please reopen to track this.

BTW, what's the purpose of this rewrite rule?

{-# RULES "foldCase/ByteString" foldCase = foldCaseBS #-}
basvandijk commented 8 years ago

For some reason that RULE made the benchmark faster.

winterland1989 commented 8 years ago

What if we implemented CI using a type family? then we can keep original ByteString slice and do a more efficient copy to FoldedCase ByteString. I think this is the best option but it has some compatibility issue. What do you think?

type family FoldedCase a where
    FoldedCase B.ByteString = Short.ShortByteString
    FoldedCase BL.ByteString = [Short.ShortByteString]
    FoldedCase T.Text = T.Text
    FoldedCase TL.Text = TL.Text

data CI s = CI { original   :: !s -- ^ Retrieve the original string-like value.
               , foldedCase :: !(FoldedCase s) -- ^ Retrieve the case folded string-like value.
                                  --   (Also see 'foldCase').
               }

Another reason i propose this solution is that the document of ShortByteString says It is suitable for use as an internal representation for code that needs to keep many short strings in memory, but it should not be used as an interchange type..

winterland1989 commented 8 years ago

Another approach is to provide a Data.CaseInsensitive.ByteString module which exports a specialized CIByteString type using ShortByteString internally. So with providing ShortByteString instance we have three options here.