Open bdillahu opened 7 years ago
Local archiving is definitely something interesting to look into. wget wouldn't work, because these days, a lot of websites are loading data using AJAX, but an increasing number are pretty much entirely dynamic.
An approach using headless browsers (see HeadlessBrowsers), where we wait a set time for the page to execute its JavaScript, and save the resulting DOM tree could be useful.
Another approach toward this that I was looking into was this =>
I think the third approach would be the best approach, with the highest accuracy. (PS: Infinite scroll pages wouldn't work with the headless browser approach).
If anyone has more ideas for this, please do share! :). I've already done some work on this that hasn't made GitHub yet (In a Chrome extension repo).
This is on the roadmap to come pretty soon.
We may want to define what we want the archive to actually store... their is the web page itself (HTML, CSS, etc.), the text of the web page (for full-text searching the archive, etc.), or an image of the web page (to show how it looked, even maybe with ads or whatever sidebar kind of things you wouldn't normally store).
I've used things like elinks to grab the basic HTML (for text searching) and http://cutycapt.sourceforge.net/ to grab an image. Actually worked fairly well.
Your content injection sounds neat, but you've past my skill level with that one :-)
Another thought is where do you store it... storing the text in the database for searching makes sense, but (an option maybe?) to store the actual HTML/whatever in an external file/archive would make sense to me.
Just some thoughts :-)
Hey @bdillahu! Apologies for the delayed response.
What we want to store: a) We might want to store the entire visible text of the page. This can help us power full text search, and it would be better than server-side crawling. b) Storing HTML/CSS, I think a better alternative would be to take a DOM screenshot. See this.
var page_content = $("html").clone().find("script,noscript").remove().end().html();
var page_content = replace_all_rel_by_abs(page_content);
var page_content = minify(page_content, {
removeComments: true,
removeCommentsFromCDATA: true,
removeCDATASectionsFromCDATA: true,
collapseWhitespace: true,
conservativeCollapse: true,
collapseBooleanAttributes: true,
removeAttributeQuotes: true,
removeRedundantAttributes: true,
minifyURLs: false,
removeEmptyElements: true,
removeOptionalTags: true
});
This code produces an almost perfect snapshot of the DOM. It is run as a content-script by the extension. Storing images might pose a problem. Either we compress them, or they will end up taking a huge amount of space. Plus without some sort of CV pipeline that makes the images searchable, there's not a lot of benefit to image archival. Also, an image of the page would let us offer a more visual representation of the bookmarks, I'm in full support of that :).
On your point about cutycapt. It does work, but for dynamic pages we need to go through the extension and get the DOM snapshot. Right now we use PhantomJS for this, which is a mid-way approach.
On the storing the files part. Would perhaps exporting from the database into HTML make sense for you?
Obiously I agree on the image of the page for a bookmark... that sounds good :-)
Not sure I concur on images not being useful even if they aren't searchable. Part of my desire is that I have bookmarked a page and I'm archiving it... often without the pictures, it isn't going to be a lot of use (obviously depends on the page). Disk is cheap :-)
Exporting to HTML/MAFF/whatever would work... doesn't necessarily have to be stored in that format, but I think some way of "getting it out" and using it elsewhere would be important.
The average size of a webpage is 2MB. About 60% of that is taken up by images. So we're saving 1.2MB per page. ref
The images don't add up to a lot though due to disk being cheap. And that would indeed be a more complete bookmark. But this needs to be a granular setting then.
I'd suggest opening a separate feature request for the exporting ability.
Having another archive provider that is a local "service" to somehow (wget maybe) save a web page archive locally would be a great addition, I think.
Not sure best format... MAFF is good, although I've had issues finding non-Firefox support for it.