chrra / iCalendar

iCalendar data types, parser, and printer.
BSD 3-Clause "New" or "Revised" License
38 stars 47 forks source link

Fails on nordic UTF8 characters #38

Open fegu opened 4 years ago

fegu commented 4 years ago

When trying to parse this UTF8 line with the default decoding (also UTF8): ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Øyvind:mailto:oyvind@somedomain.no it fails with

Left (line 24, column 67):\nunexpected \"\152\"\nexpecting \"\r\", \"\n\", ',', ';' or ':'"

The nordic Ø is here correctly UTF8-encoded as \192\152 and it chokes on \152.

My quick fix for now since we don't actually use the names for anything in our application: just search/replace the bytestring first.

MasseR commented 1 year ago

I'm pretty sure this is because the parser is parsing one Word8 at a time, takeWhile1 isSafe. The isSafe function is defined in terms of Data.Char.isControl, and it so happens that for example 'Ä' is 0xc3 0x84, and 0x84 is considered to be control character.

https://github.com/chrra/iCalendar/blob/master/Text/ICalendar/Parser/Content.hs

This can replicated by this small snippet:

main :: IO ()
main = do
  contents <- B.readFile "test"
  let Right x = P.parse (map (\c -> (c,isControl c)) <$> P.many P.anyChar) "test" contents
  mapM_ print x

With the contents of test being AÄAaäa.

> :main
('A',False)
('\195',False)
('\132',True)
('A',False)
('a',False)
('\195',False)
('\164',False)
('a',False)
('\n',True)
wkoiking commented 1 year ago

Indeed, I was able to mitigate this issue by changing the definition of TextParser in Calendar/Text/ICalendar/Parser/Common.hs

from

type TextParser = P.Parsec ByteString DecodingFunctions

to

import qualified Text.Parsec.Text.Lazy  as TP
type TextParser = TP.Parser

and addressing all the type error (mostly just changing ByteString to Text).

I do not know why originally ByteString parser is used in stead of Text parser. So this change might cause the unexpected bugs somewhere else but at least this issue is resolved.