WebMemex / freeze-dry

Snapshots a web page to get it as a static, self-contained HTML document.
https://freezedry.webmemex.org
The Unlicense
271 stars 18 forks source link

Handle encoding of subresources #46

Open Treora opened 5 years ago

Treora commented 5 years ago

Freeze-dry messes up if a stylesheet or framed document is encoded in utf16, utf32, or possibly other encodings. We use FileReader.readAsText to decode these resources, which by default assumes utf8 encoding. This assumption is adequate most of the time, but when it isn’t the resource is effectively unreadable.

I do not know enough about the standards, but I suppose the decoder should look at the HTTP Content-Type header, the file’s byte order mark (BOM), and in-document declarations (@charset in CSS, <meta charset=…> in HTML).

This detection&decoding issue seems so generic it should not have to burden this repo, but I have not yet discovered the right tool. Some options I thought of:

Tips welcome.

Note this issue is similar to issue #29, but that one concerns the DOM that the browser has already decoded for us; this issue is about subresources we fetch.