WebMemex / freeze-dry

Snapshots a web page to get it as a static, self-contained HTML document.
https://freezedry.webmemex.org
The Unlicense
271 stars 18 forks source link

Inline iframe contents #3

Closed Treora closed 6 years ago

Treora commented 7 years ago

Freeze-dry could be run recursively on iframes. Iframe contents can probably be put as a string in the srcdoc attribute.

Although deprecated, it would be nice to still support <frame>s too; they don't support srcdoc though, so we should try putting contents as a data URL in the src attribute.

erikrose commented 6 years ago

I'm moderately interested in having this happen. Talking with Ian Bicking, who has a similar project, he mentioned that there are webext permission issues around getting ahold of (in particular) the CSS of iframes but that using a framescript might avail. I mention this to save time for anybody who pursues this.

Treora commented 6 years ago

My initial experiments also revealed some complications. In a browser extension, we could (and might have to) resort to running a content script in every frame to grab that frame's contents, then pass the pieces around and assemble them into a single html string. I am not sure about the best architecture for this.

If it would be possible however, I would love to make things work without requiring a WebExtension, so that e.g. a web page can create a snapshot of itself (assuming it can access its frames' content).

Regarding @ianb's work; I am still planning to scrutinise his approach to making static html https://github.com/ianb/pagearchive/blob/01f832583380309ec167c77f9af61e6f0af8f6aa/extension/make-static-html.js. Whereas freeze-dry currently clones the whole DOM and then modifies it, his approach is to only pick the tags and attributes that are listed explicitly. It may be worth comparing the pros and cons of these approaches in a separate issue.

erikrose commented 6 years ago

I think one of the things he was proudest of is his spidering of CSS @import statements. But that's just from my fuzzy recollection.

Treora commented 6 years ago

Implemented in 0.2.0. :)

Still need to allow browser extensions etcetera to customise how the iframe contents are captured, for in case direct access to .contentDocument is prohibited (by the single origin policy). In any case, it now falls back to refetching the iframe's source html and using that instead of the currently rendered iframe content. (edit: see #24)

erikrose commented 6 years ago

Hooray! This is a great step forward, and I'll soon update FathomFox's version of freeze-dry to enjoy it. :-D

erikrose commented 6 years ago

I don't seem to actually get inlined iframe contents in 0.2.0. Instead, I get src= some external server, like this:

<iframe style="border: 0px none; vertical-align: bottom;" src="https://tpc.googlesyndication.com/safeframe/1-0-29/html/container.html" id="google_ads_iframe_/74268401/UG_ATF_728_0" title="3rd party ad content" name="" scrolling="no" marginwidth="0" marginheight="0" data-is-safeframe="true" sandbox="allow-forms allow-pointer-lock allow-popups allow-popups-to-escape-sandbox allow-same-origin allow-scripts allow-top-navigation-by-user-activation" width="728" height="90" frameborder="0">

(The source page for the above was https://tabs.ultimate-guitar.com/tab/johann_sebastian_bach/sleepers_awake_tabs_877865.) Do you have any ideas?