Ackater / writing.com-archival

Utility for downloading Interactives from Writing.com
https://ackater.github.io/writing.com-archival
23 stars 3 forks source link

Handle writing.com's encoding #12

Open Ackater opened 5 years ago

Ackater commented 5 years ago

Titles in chapters as well as outlines seem to be encoded in latin-1, while the actual text content is encoded in unicode fine.

Ackater commented 5 years ago

This has left me more confused than not:

Using https://www.writing.com/main/interact/item_id/1924673-Acquiring-Powers-2/action/outline as a test

https://www.writing.com/main/interact/item_id/1924673-Acquiring-Powers-2/map/114331111 chapter title seems to be encoded in windows-1252 because … does not exist in latin-1. This one also shows the choice at the top being encoded in utf-8.

on the other hand, https://www.writing.com/main/interact/item_id/1924673-Acquiring-Powers-2/map/1133322211222121221211411222211121131111111112222112222213111212231112114231121141133111212322211112211232231241212122211112232 chapter title is in utf-8 perfectly fine

Should every piece be decoded in latin-1 by default, then attempted in UTF-8, then windows-1252, then latin-1?!?

Ackater commented 4 years ago

Found another fun one: search pages will cut off multi-byte utf-8 character in the middle of the character for ellipsis.