IgnoredAmbience / yahoo-group-archiver

Scrapes and archives a Yahoo groups email archives, photo galleries and file contents using the non-public API
MIT License
93 stars 46 forks source link

Non-Latin character set content #90

Open eyalroz opened 4 years ago

eyalroz commented 4 years ago

I've (successfully) used the archiver script to archive my list. However, a lot of content it downloads ends up being "doubly-charset-encoded": It's cp1255 content, whose encoded form is then encoded in UTF-8. Which means it's non-trivial-to-difficult to actually extract the text from those files.

Saw this in .json files in the about and calendar folders.

Oh, and - thank you for this wonderful script!

IgnoredAmbience commented 4 years ago

Thanks for the report, I think I'm going to have to look more closely at the encoding information (if any) returned by the Yahoo servers.