WebMemex / freeze-dry

Snapshots a web page to get it as a static, self-contained HTML document.
https://freezedry.webmemex.org
The Unlicense
270 stars 18 forks source link

Handle charset encoding declaration #29

Closed Treora closed 5 years ago

Treora commented 6 years ago

The document may have a <meta charset="..."> tag in the <head>, but that will be obsoleted as we use the parsed document, and later stringify it again. I suppose we could/should delete it from the DOM when capturing it.

Vice versa, we may want to add the appropriate <meta charset="..."> tag to the snapshot; but this seems a task for the application invoking freeze-dry, as we do not know in which encoding the application will store the string.

We could thus..

Treora commented 5 years ago

Resolved in commit cefd79c, which adds an encoding declaration as requested by the user (the second option above), while presumptively defaulting to set it as utf-8. My reasoning as put in the commit message:

Since we return a string, how the user will encode that string should
ideally not matter to us. However, as HTML has the remarkable approach
of declaring the encoding somewhere inside the string, the user would
need to parse part of the DOM again to insert the declaration at the
right spot. If the user already knows how it will encode the string
afterward, I suppose we can help by inserting the declaration already.

In any case, we should remove any encoding declarations that the page
originally had, because the file is always reencoded.

Regarding the default action, an intuitive behaviour would be to not add
any meta tag. But because utf-8 is the most widespread and officially
recommended encoding for web documents, and also because many javascript
APIs use it as the default (or only) encoding (e.g. the Blob
constructor), it feels like a helpful default.

I suppose that snapshots have so often worked fine so far simply because many web pages have an utf-8 declaration which we did not remove, while applications (at least the WebMemex browser extension) also use utf-8 encoding.