'Freeze-dry' web page archiving

Treora commented 7 years ago

Background

People should be able to keep web pages they have visited. The usual practice of just keeping the URL of the page (e.g. when bookmarking) is problematic. Pages are served in a single remote place, and have no versioning information, so one has to rely on that single authority to keep a document available. Browsers have always had a 'save page as..' option, but has had too little love. There are different ways to approach web page archival (e.g. recording all transactions), but a simple one, that can be achieved in a browser extension, is to save the rendered DOM as carefully as possible.

Scope

This issue is about making a script that takes a 'live' web page, i.e. a page that is currently displayed in a tab in a browser, and converts it to a single, static html file without external dependencies ('freeze-drying' it, though I'm open to other name suggestions). Opening the page in a normal browser should not trigger any connections to the outside world, while displaying the page as accurately as possible. Some things it should take care of:

Images are to be inlined as data: URIs.
External stylesheets can be collected and nested in <style> elements.
Scripts, including event handlers in element attributes, have to be removed. (in case somebody figures out a way to allow simple&safe scripts, that would be great, but it seems far out of scope)
Other embeds, objects, and some types of links may have to be removed or rewritten.

In the process, metadata should be added to each inlined or rewritten element to keep a reference to the origins. For example, the original src attribute of an img tag would be moved to another attribute, perhaps using RDFa and a standard vocabulary: <img src="data:image/png,....." rel="dc:source" resource="http://example.org/original_location.png" />. The exact choice of what&how to store could be decided/improved later (at least we should also register the date of retrieval). The document as a whole should in the same manner get some appropriate metadata to inform about its origin.

The script could be created as a separate module, in a separate repo, to make it usable in other contexts. Assuming it would not require any APIs specific to browser extensions, it could be run on other platforms or be added pages to archive themselves. I suppose the module could provide a single function freezeDry(dom=window.document, options={}) that returns the new html file (as a string/Blob/DOM object?).

Prior art

There are at least two notable browser extensions that perform a similar archiving procedure, so some tricks may be borrowed from them: SingleFile (see especially docprocessor.js), and Scrapbook (I found no repo of it online, so download and unzip the addon to get its source code.. look at chrome/content/scrapbook/saver.js). If anyone knows of more noteworthy or (ideally) reusable code, let know.

blackforestboi commented 7 years ago

Could it be that SingleFile is not open source? At least they share no license.

I also found the MHTML format: https://en.wikipedia.org/wiki/MHTML It is supported by all browsers.

Mozilla also has a project/addon:

https://addons.mozilla.org/en-US/firefox/addon/mozilla-archive-format/

Treora commented 7 years ago

Could it be that SingleFile is not open source? At least they share no license.

It is GPLv3 licensed.

I forgot about MHTML altogether; I like how it allows bundling multiple files while giving space for their original URLs and other metadata. It seems badly supported however, and without support it cannot be interpeted as a normal html file (it is formatted as an email!), so using data: URIs plus some RDFa seems a better approach to me.

blackforestboi commented 7 years ago

It is GPLv3 licensed.

Oh :) not where I expected that file.

It seems badly supported however, and without support it cannot be interpeted as a normal html file (it is formatted as an email!), so using data: URIs plus some RDFa seems a better approach to me.

What a pity. Thanks for explaining.

mangalutsav commented 7 years ago

Can't be use pdfs, they are supported by all the browser and are easy to read and edit

ikreymer commented 7 years ago

Hi @Treora, glad to see this project taking off! We chatted not too long ago about possible collaboration.

I just wanted to throw in a few suggestions/thoughts about formats. In the web archiving world and with Webrecorder project, the WARC format is pretty standard. There is also the HAR format (http://www.softwareishard.com/blog/har-12-spec/), which is supported by at least Chrome and Firefox. You may be able to access these HAR export tools from browser extension apis, eg: https://developer.chrome.com/extensions/devtools_network

These formats are designed for storing the transactions (and there is now a tool for converting HAR->WARC). In Webrecorder, we also have a "Static Snapshot" option which tries to save the current state of the DOM and put it back into a WARC file. This turned out to be more tricky than it seemed at first, especially due to <iframe> of infinite nesting. I think currently we are saving them to a separate entry, but you might be able to do it through clever using of nested srcdoc=

Since Webrecorder already has the images and external links saved through the transactional recording, we need not worry about those, but we do remove all the <script> elements.

Of course, this approach will likely break video or other objects, or tags that refer to local storage, eg. blob: unless there's a way to rewrite these into an external format. This will result in lower fidelity than a user may expect when viewing the live page.

I guess my question would be: are you trying to produce a single HTML or do you want to use an existing format that contains HTML? If the latter, than I think it really makes sense to work with an existing format to maintain interoperability would be best.

Treora commented 7 years ago

@ikreymer Hi Ilya, thanks for popping in and thinking along!

Saving and replaying transactions like you do with WARC (or HAR) is indeed the right way to archive content as truthfully as possible. However, for a few reasons my idea is to now take a different approach, and save the rendered DOM instead.

It seems impossible to do register transactions from a browser extension; the API in link you shared appears only usable when the dev-tools window is open (also it is not available in Firefox). Through the normal webRequest API, one cannot obtain the response bodies, only the headers. A browser extension that gets close by working around the limitations is WARCreate, which (if I understood it right) does tricks like reading stylesheet URLs from the DOM and fetching every URL again in the background.
The transactions give you the sources, so for scripted pages you would have to rerun every script to obtain the same rendered document again (while hoping the scripts are deterministic, and future browsers are compatible). I would like to have a robust copy of the rendered page, which may also be stored more efficiently.
Following from the previous point, 'compiling' the sources to a static rendered html file makes it easier to analyse and process it, refer reliably to parts of its content (for creating links/annotations), and also makes it possible to edit the document; note that the goal here may go beyond just archiving the original page for viewing purposes, more towards really owning your copy of the document.
Whether to make it a single file, by inlining images and stylesheets etcetera, goes one step further and is perhaps a somewhat separate question. This could be omitted, but having everything in a single file greatly reduces the complexity of managing documents and their dependencies, keeping them all together when sharing them, updating them when editing, etcetera. It thereby also enables people to just take the file and use normal file managing and sharing tools. Viewers need not be aware of archiving formats; by adding the right metadata (e.g. original urls of inlined images), archive-aware parsers could however reconstruct information about the original sources: progressive enhancement.

So, to answer your question, I think I am trying to produce a single HTML file, and one reason is that it is an existing format that maintains interoperability: however, not interoperability with archiving tools, but with file managers, document viewers and editors. It would be nice to have both, but I do not see a pragmatic way to do this now. I'd be glad to hear your thoughts. A future idea could be to also create WARCs, for genuine archiving, but that would be a separate endeavour.

I forgot that webrecorder also supports making a static snapshot. Is processing the DOM (e.g. removing the scripts) done on the Python side? (or at least it seems not to happen in the page itself) Also thanks for the iframe srcdoc tip, I had not thought much about inlining iframes.

Treora commented 7 years ago

First basic implementation added in #100.

Many improvements are still to be made, I intend to spin the code in src/freeze-dry off into its own repo and module at some moment, to develop and publish it independently. (edit: it now has a repo and is published as an independent module)

WebMemex / webmemex-extension