aparrish / gutenberg-dammit

I wanted all of plaintext Project Gutenberg in an easy-to-use format, so I made this
211 stars 14 forks source link

at least one file's utf-8 encoding is wrong, presumably more? #7

Open mlc opened 6 years ago

mlc commented 6 years ago

Hi, thanks for this excellent work!

I suspect it's not an isolated incident, but don't presently have anything beyond a single anecdote:

Anyway that's the data I have for now…

aparrish commented 6 years ago

oof, thanks for the data point, I'll look into it when I get a sec.

mlc commented 6 years ago

Found another one: Τα Γεωργικά is fine on the Gutenberg website, but double-utf8-encoded in gutenberg-dammit (if you recode it from utf-8 to "latin-1" you end up with valid-looking utf8 Greek text).

aparrish commented 6 years ago

hi—just to verify, are you using the latest version (002)? I do know for sure that the original version had messed up encodings, which is why I did a second release.

mlc commented 6 years ago

Yup, I downloaded a fresh copy of the archive just now, and manually inspected the relevant files, in order to triple-check that these two problems still exist.

The bot I wrote using your corpus has posted 267 times as of now, so with two misencoded files found, that's a rate of about 0.7% — certainly not bad at all.

Thanks again!

aparrish commented 6 years ago

great, thank you for checking! I'll fix in the next release.