diagrams / diagrams-haddock

Preprocessor for including inline diagrams in Haddock documentation
Other
10 stars 4 forks source link

diagrams-haddock does not properly roundtrip non-ASCII UTF-8 #8

Closed byorgey closed 11 years ago

byorgey commented 11 years ago

Here's a minimal test case:

archimedes :: src/diagrams/haddock » cat > TestUTF8.hs
-- á
-- <<foo#diagram=foo&width=100>>
-- > foo = circle 1
archimedes :: src/diagrams/haddock » head -n 1 TestUTF8.hs | hexdump -C
00000000  2d 2d 20 c3 a1 0a                                 |-- ...|
00000006
archimedes :: src/diagrams/haddock » diagrams-haddock TestUTF8.hs
archimedes :: src/diagrams/haddock » head -n 1 TestUTF8.hs | hexdump -C
00000000  2d 2d 20 e1 0a                                    |-- ..|
00000005

Note how the bytes c3a1 (the UTF-8 encoding of á) turn into e1 (the ISO-8859-1 encoding of á). And it completely barfs on something outside of ISO-8859-1. E.g., let's try a snowman:

archimedes :: src/diagrams/haddock » cat > TestUTF8.hs
-- ☃
-- <<foo#diagram=foo&width=100>>
-- > foo = circle 1
archimedes :: src/diagrams/haddock » head -n 1 TestUTF8.hs | hexdump -C
00000000  2d 2d 20 e2 98 83 0a                              |-- ....|
00000007
archimedes :: src/diagrams/haddock » diagrams-haddock TestUTF8.hs
archimedes :: src/diagrams/haddock » head -n 1 TestUTF8.hs | hexdump -C
00000000  2d 2d 20 03 0a                                    |-- ..|
00000005
byorgey commented 11 years ago

Progress: the bug goes away when replacing System.IO.Cautious.writeFile with Prelude.writeFile. So it seems the cautious-file package is somehow the culprit.

byorgey commented 11 years ago

Oh, naughty naughty cautious-file! It uses Data.ByteString.Lazy.Char8.pack to turn a String into a ByteString before writing it out to the temporary file. =( This is very much the Wrong Thing (tm) to do.

So it seems we have a few options: (1) stop using cautious-file and just hope everything goes OK using writeFile directly. (2) Do the proper encoding ourselves (maybe using the encoding package? I am not sure what the best/canonical package is for doing this kind of thing), and call System.IO.Cautious.writeFileL (which takes a lazy ByteString). (3) File a bug against cautious-file and wait for a new release.

In any case I will definitely report a cautious-file bug.

byorgey commented 11 years ago

For the record, it seems the modern way to encode a String into UTF-8 is via Text: Data.Text.Encoding.encodeUtf8 . Data.Text.pack.