anselmorenato / threepress

Automatically exported from code.google.com/p/threepress
Other
0 stars 0 forks source link

Detect ISO-8859-1 encoding in files and re-encode #145

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Although epub is required to contain only UTF-8 or UTF-16, of course it's
possible to sneak in ISO-8559-1.  Some characters in that set don't map to
Unicode and in the live/staging environment they are blindly added to the
database and then get truncated at the first invalid character.  The user
isn't notified.

This is especially bad when it happens in the NCX or OPF file, as they
become invalid XML once they go into the database, but aren't invalid when
they come out of the ePub archive, so the initial sanity checks on upload pass.

The best outcome is probably for Bookworm to always manage to convert the
file properly before saving, although I'm not sure yet how best to do that
as this particular truncation problem doesn't happen in my local
environment (instead I get a DjangoUnicodeEncode exception immediately on
upload).

Original issue reported on code.google.com by liza31337@gmail.com on 14 May 2009 at 11:15

GoogleCodeExporter commented 9 years ago
Filed related problem with epubcheck as it is passing such epubs:
http://code.google.com/p/epubcheck/issues/detail?id=34

Original comment by liza31337@gmail.com on 14 May 2009 at 11:24

GoogleCodeExporter commented 9 years ago

Original comment by liza31337@gmail.com on 19 May 2009 at 2:57

GoogleCodeExporter commented 9 years ago
These books are invalid.

Original comment by liza31337@gmail.com on 13 Nov 2009 at 5:33