WebMemex / freeze-dry

Snapshots a web page to get it as a static, self-contained HTML document.
https://freezedry.webmemex.org
The Unlicense
270 stars 18 forks source link

Fix charset encoding of framed documents #51

Open Treora opened 4 years ago

Treora commented 4 years ago

Like issue #29, but for subdocuments inside frames. As remarked here:

        get blob() { return new Blob([this.string], { type: 'text/html' }) },
        get string() {
            // TODO Add <meta charset> if absent? Or html-encode characters as needed?
            return documentOuterHTML(clonedDoc)
        },

The same applies to crawl-subresources for frames whose inner document we cannot access directly.

It seems new Blob() always utf-8-encodes given strings (mdn). I suppose we should either add <meta charset="utf-8"> to the DOM before running documentOuterHTML. Alternatively, we change the blob’s MIME type to text/html;charset=utf-8; something we could not do for the top-level document — might that be ‘cleaner’?

Problem observed in the wild.