Own-Data-Privateer / hoardy-web

Passively capture, archive, and hoard your web browsing history, including the contents of the pages you visit, for later offline viewing, mirroring, and/or indexing. Your own personal private Wayback Machine that can also archive HTTP POST requests and responses, as well as most other HTTP-level data.
https://oxij.org/software/hoardy-web/
GNU General Public License v3.0
41 stars 1 forks source link
archive archiver archiving auto-save backups browser-extension cli internet internet-archiving offline-reading self-hosted snapshot wayback-machine web-archive web-archiving web-browsing website-archive

What is Hoardy-Web?

Hoardy-Web is a suite of tools that helps you to passively capture, archive, and hoard your web browsing history. Not just the URLs, but also the contents and the requisite resources (images, media, CSS, fonts, etc) of the pages you visit. Not just the last 3 months, but from the beginning of time you start using it.

Practically speaking, you install the Hoardy-Web browser extension/add-on into your web browser and just browse the web normally while Hoardy-Web passively, in background, captures and archives web pages you visit for later offline viewing, mirroring, and/or indexing. Hoardy-Web has a lot of configuration options to help you tweak what should or should not be archived and a very low memory footprint, keeping you browsing experience snappy even on ancient hardware (unless explicitly configured otherwise to, e.g., minimize writes to disk instead).

If you just want to start saving your browsing history, you can start using the Hoardy-Web extension independently of other tools that are being developed in this repository. But to display, search, extract useful values from, organize, manipulate, and run scripts over your archived data, you will eventually need to install and use at least the accompanying hoardy-web CLI tool.

To learn more:

Hoardy-Web was previously known as "Personal Private Passive Web Archive" aka "pwebarc".

If you are reading this on GitHub, be aware that this repository is a mirror of a repository on the author's web site. In author's humble opinion, the rendering of the documentation pages there is superior to what can be seen on GitHub (its implemented via pandoc there).

Screenshots

Screenshot of Firefox's viewport with extension's popup shown.

Screenshot of Chromium's viewport with extension's popup shown.

See there for more screenshots.

Why does Hoardy-Web exists?

For a relatively layman user

So, you wake up remembering something interesting you saw a long time ago. Knowing you won't find it in your normal browsing history, which only contains the URLs and the titles of the pages you visited in the last 3 months, you try looking it up on Google. You fail. Eventually, you remember the website you seen it at, or maybe you re-discovered the link in question in an old message to/from a friend, or maybe a tool like recoll or Promnesia helped you. You open the link… and discover it offline/gone/a parked domain. Not a problem! Have no fear! You go to Wayback Machine and look it up there… and discover they only archived an ancient version of it and the thing you wanted is missing there.

Or, say, you read a cool fanfiction on AO3 years ago, you even wrote down the URL, you go back to it wanting to experience it again… and discover the author made it private... and Wayback Machine saved only the very first chapter.

"If it is on the Internet, it is on Internet forever!" they said. They lied!

Things vanish from the Internet all the time, Wayback Machine is awesome, but

Meanwhile, Hoardy-Web solves all of the above out-of-the-box (though, the full-text search is currently being done by other tools running on top of it).

For a user with accessibility or comfort requirements

Say, there is a web page that can not be easily reached via curl/wget (because it is behind a paywall or complex authentication method that is hard to reproduce outside of a browser) but for accessibility or just simple reading comfort reasons each time you visit that page you want to automatically feed its source to a script that strips and/or modifies its HTML markup in a website-specific way and feeds it into a TTS engine, a Braille display, or a book reader app.

With most modern web browsers you can do TTS either out-of-the-box or by installing an add-on (though, be aware of privacy issues when using most of these), but tools that can do website-specific accessibility without also being website-specific UI apps are very few.

Meanwhile, Hoardy-Web with some scripts can do it.

For a technical user

Say, there's a web page/app you use (like a banking app), but it lacks some features you want, and in your browser's Network Monitor you can see it uses JSON RPC or some such to fetch its data, and you want those JSONs for yourself (e.g., to compute statistics and supplement the app output with them), but the app in question has no public API and scraping it with a script is non-trivial (e.g., the site does complicated JavaScript+multifactor-based auth, tries to detect you are actually using a browser, and bans you immediately if not).

Or, maybe, you want to parse those behind-auth pages with a script, save the results to a database, and then do interesting things with them (e.g., track price changes, manually classify, annotate, and merge pages representing the same product by different sellers, do complex queries, like sorting by price/unit or price/weight, limit results by geographical locations extracted from text labels, etc).

Or, say, you want to fetch a bunch of pages belonging to two recommendation lists on AO3 or GoodReads, get all outgoing links for each fetched page, union sets for the pages belonging to the same recommendation list, and then intersect the results of the two lists to get a shorter list of things you might want to read with higher probability.

Or, more generally, say, you want to tag web pages referenced from a certain set of other web pages with some tag in your indexing software, and update it automatically each time you visit any of the source pages.

Or, say, you want to combine a full-text indexing engine, your browsing and derived web link graph data, your states/ratings/notes from org-mode, messages from your friends, and other archives, so that you could do arbitrarily complex queries over it all, like "show me all GoodReads pages for all books not marked as DONE or CANCELED in my org-mode files, ever mentioned by any of my friends, ordered by undirected-graph Pagerank algorithm biased with my own book ratings (so that books sharing GoodReads lists with the books I finished and liked will get higher scores)". So, basically, you want a private personalized Bayesian recommendation system.

"Everything will have a RESTful API!" they said. They lied! A lot of useful stuff never got RESTful APIs, those RESTful APIs that exists are frequently buggy, you'll probably have to scrape data from HTMLs anyway.

"Semantic Web will allow arbitrarily complex queries spanning multiple data sources!" they said. Well, 25 years later ("RDF Model and Syntax Specification" was published in 1999), almost no progress there, the most commonly used subset of RDF does what indexing systems in 1970s did, but less efficiently and with a worse UI.

Meanwhile, Hoardy-Web provides some of the tools to help you build your own little local data paradise.

Highlights

The Hoardy-Web browser extension runs under desktop versions of both Firefox- and Chromium-based browsers as well as under Firefox-for-Android-based browsers.

Hoardy-Web's main workflow is to passively collect and archive HTTP requests and responses (and, if you ask, also DOM snapshots, i.e. the contents of the page after all JavaScript was run) directly from your browser as you browse the web.

Therefore, Hoardy-Web allows you to

all the while

Hoardy-Web can archive collected data

In other words, Hoardy-Web is your own personal private Wayback Machine which passively archives everything you see and, unlike the original Wayback Machine, also archives HTTP POST requests and responses, and most other HTTP-level data.

Also, unless configured otherwise, Hoardy-Web will dump and archive collected data immediately, to both prevent data loss and to free the used RAM as soon as possible, keeping your browsing experience snappy even on ancient hardware.

Compared to most of its alternatives, Hoardy-Web DOES NOT:

Technically, Hoardy-Web is most similar to

Or, to summarize it another way, you can view Hoardy-Web as an alternative for mitmproxy which leaves SSL/TLS layer alone and hooks into target application's runtime instead.

In fact, an unpublished and now irrelevant ancestor project of Hoardy-Web was a tool to generate website mirrors from mitmproxy stream captures. (By the way, if you want that, hoardy-web CLI tool can do that for you. It can take mitmproxy dumps as inputs.) But then I got annoyed by all the sites that don't work under mitmproxy, did some research into the alternatives, decided there were none I wanted to use, and so I made my own.

Parts and pieces

Required

Optional, but convenient

Optional, but almost always required at some point

Optional, but useful

Technical Philosophy

Hoardy-Web is designed to

To conform to the above design principles

Hoardy-Web expects you to treat your older pre-WRR archives you want to convert to WRR similarly:

This way, if hoardy-web has some unexpected bug, or hoardy-web import adds some new feature, you could always re-import them later without losing anything.

Supported use cases

For a relatively layman user

Currently, Hoardy-Web has two main use cases for regular users, in both of which you first capture some data using the add-on and then you either

For a more technical user

Alternatively, you can programmatically access that data by asking the hoardy-web CLI tool to dump WRR files into JSONs or verbose CBORs for you, or you can just parse WRR files yourself with readily-available libraries.

Since the whole of hoardy-web (the project) adheres to the philosophy described above, the simultaneous use of Hoardy-Web (the extension) and hoardy-web (the tool) helps immensely when developing scrapers for uncooperative websites: you just visit them via your web browser as normal, then, possibly years later, use the hoardy-web tool to organize your archives and conveniently programmatically feed the archived data into your scraper without the need to re-fetch anything.

Given how simple the WRR file format is, you can modify any HTTP library to generate WRR files, thus allowing you to use the hoardy-web tool with data captured by other software, and use data produced by the Hoardy-Web extension as inputs to your own tools.

Which is why, personally, I patch some of the commonly available FLOSS website scrapers to dump the data they fetch as WRR files so that in the future I could write my own better scrapers and indexers and test them on a huge collected database of already collected inputs immediately.

Also, as far as I'm aware, hoardy-web is a tool that can do more useful stuff to your WRR archives than any other tool can do to any other file format for HTTP dumps with the sole exception of WARC.

What does it do, exactly? I have questions.

Does the author eat what he cooks?

Yes, as of October 2024, I archive all of my web traffic using Hoardy-Web, without any interruptions, since October 2023. Before that my preferred tool was mitmproxy.

After adding each new feature to the hoardy-web tool, as a rule, I feed at least the last 5 years of my web browsing into it (at the moment, most of it converted from other formats to .wrr, obviously) to see if everything works as expected.

Quickstart

Install Hoardy-Web browser extension/add-on

... check it actually works

Now load any web page in your browser. The extension will report if everything works okay, or tell you where the problem is if something is broken.

... and you are done

Assuming the extension reported success: Congratulations! You are now collecting and archiving all your web browsing traffic originating from that browser. Repeat extension installation for all browsers/browser profiles as needed.

Technically speaking, if you just want to collect everything and don't have time to figure out how to use the rest of this suite of tools right this moment, you can stop here and figure out how to use the rest of this suite later.

It took me about 6 months before I had to refer back to previously archived data for the first time when I started using mitmproxy to sporadically collect my HTTP traffic in 2017. So, I recommend you start collecting immediately and be lazy about the rest. Also, I learned a lot about nefarious things some of the websites I visit do in the background while doing that, now you are going to learn the same.

In practice, though, your will probably want to install at least the hoardy-web-sas simple archiving server (see below for instructions) and switch Hoardy-Web to Submit dumps via 'HTTP' mode pretty soon because it is very easy to accidentally loose data using other archival methods and, assuming you have Python installed on your computer, it is also the most convenient archival method there is.

Or, alternatively, you can use the combination of archiving by saving of data to browser's local storage (the default) followed by manual export into WRR bundles as described below in the section on using Hoardy-Web together with Tor Browser.

Or, alternatively, you can switch to Export dumps via 'saveAs' mode by default and simply accept the resulting slightly more annoying UI (on Firefox, it can be fixed with a small about:config change) and the facts that you can now lose some data if your disk ever gets out of space or if you accidentally mis-click a button in your browser's Downloads UI.

Recommended next steps

Next, you should read extension's Help page. It has lots of useful details about how it works and quirks of different browsers. If you open it by clicking the Help button in the extension's UI, then hovering over or clicking on links in there will highlight relevant settings.

See "Setup recommendations" section for best practices for configuring your system and browsers to be used with Hoardy-Web.

Installing the Pythonic parts

Pre-installation

Installation

Alternatively, on a system with Nix package manager

Setup recommendations

In general

Using Hoardy-Web with Tor Browser

When using Hoardy-Web with Tor Browser, you probably want to configure it all in such a way so that all of the machinery of Hoardy-Web is completely invisible to web pages running under your Tor Browser, to prevent fingerprinting.

Mostly convenient, paranoid

So, in the mostly convenient yet sufficiently paranoid setup, you would only ever use Hoardy-Web extension configured to archive captured data to browser's local storage (which is the default) and then export your dumps manually at the end of a browsing session, see re-archival intructions.

Yes, this is slightly annoying, but this is the only absolutely safe way to export data out of Hoardy-Web without using submission via HTTP, and you don't need to do this at the end of each and every browsing session.

Simpler, but slightly unsafe

You can also simply switch to using Export dumps via 'saveAs' by default instead.

I expect this to work fine for 99.99% of the users 99.99% of the time, but, technically speaking, this is unsafe. Also, by default, browser's UI will be slightly annoying, since Hoardy-Web will be generating new "Downloads" all the time, but that issue can be fixed with a small about:config change.

Most convenient, less paranoid

In theory, running ./hoardy_web_sas.py listening on a loopback IP address should prevent any web pages from accessing it, since the browsers disallow such cross-origin requests, thus making the normal Submit dumps via 'HTTP' mode setup quite viable. However, Tor Browser is configured to proxy everything via the TOR network by default, so you need to configure it to exclude the requests to ./hoardy_web_sas.py from being proxied.

A slightly more paranoid than normal way to do this is:

Why? When using Tor Browser, you probably don't want to use 127.0.0.1 and 127.0.1.1 as those are normal loopback IP addresses used by most things, and you probably don't want to allow any JavaScript code running in Tor Browser to (potentially, if there are any bugs) access to those. Yes, if there are any bugs in the cross-domain check code, with this setup JavaScript could discover you are using Hoardy-Web (and then, in the worst case, DOS your system by flooding your disk with garbage dumps), but it won't be able to touch the rest of your stuff listening on your other loopback addresses.

So, while this setup is not super-secure if your Tor Browser allows web pages to run arbitrary JavaScript (in which case, let's be honest, no setup is secure), with JavaScript always disabled, to me, it looks like a completely reasonable thing to do.

Best of both

In theory, you can have the benefits of both invisibility of archival to local storage and convenience, guarantees, and error reporting of archival to an archiving server at the same time:

In practice, doing this manually all the time is prone to errors. Automating this away is on the TODO list.

Then, you can improve on this setup even more by running both the Tor Browser and ./hoardy_web_sas.py in separate containers/VMs.

Alternatives

"Cons" and "Pros" are in comparison to the main workflow of Hoardy-Web. Most similar and easier to use projects first, harder to use and less similar projects later.

archiveweb.page and replayweb.page

Tools most similar to Hoardy-Web in their implementation, though not in their philosophy and intended use.

Pros:

Cons:

Differences in design:

Same issues:

DownloadNet

A self-hosted web app and web crawler written in Node.js most similar to Hoardy-Web in its intended use.

DownloadNet does its web crawling by spawning a Chromium browser instance and attaching to it via its debug protocol, which is a bit weird, but it does work, and with exception of Hoardy-Web it is the only other tool I know of that can archive everything passively as you browse, since you can just browse in that debugged Chromium window and it will archive the data it fetches.

Pros:

Cons:

Same issues:

SingleFile and WebScrapBook

Browser add-ons that capture whole web pages by taking their DOM snapshots and saving all requisite resources the captured page references.

Pros:

Cons:

Differences in design:

WorldBrain Memex

A browser extension that implements an alternative mechanism to browser bookmarks. Saving a web page into Memex saves a DOM snapshot of the tab in question into an in-browser database. Memex then implements full-text search engine for saved snapshots and PDFs.

Pros:

Cons:

Differences in design:

But you could just enable request logging in your browser's Network Monitor and manually save your data as HAR archives from time to time.

Cons:

And then you still need something like this suite to look into the generated archives.

mitmproxy

A Man-in-the-middle SSL proxy.

Pros:

Cons:

Though, the latter issue can be solved via this project's hoardy-web tool as it can take mitmproxy dumps as inputs.

But you could setup SSL keys dumping then use Wireshark to capture your web traffic.

Pros:

Cons:

And then you still need something like this suite to look into the generated archives.

ArchiveBox

A web crawler and self-hosted web app into which you can feed the URLs for them to be archived.

Pros:

Cons:

Still, probably the best of the self-hosted web-app-server kind of tools for this ATM.

reminiscence

A system similar to ArchiveBox, but has a bulit-in tagging system and archives pages as raw HTML + whole-page PNG rendering/screenshot --- which is a bit weird, but it has the advantage of not needing any replay machinery at all for re-viewing simple web pages, you only need a plain simple image viewer, though it will take a lot of disk space to store those huge whole-page "screenshot" images.

Pros and Cons are almost identical to those of ArchiveBox above, except it has less third-party tools around it so less stuff can be automated easily.

wget -mpk and curl

Pros:

Cons:

wpull

wget -mpk done right.

Pros:

Cons:

grab-site

A simple web crawler built on top of wpull, presented to you by the ArchiveTeam, a group associated with the Internet Archive which appears to be the source of archives for the most of the interesting pages I find there.

Pros:

Cons:

monolith and obelisk

Stand-alone tools doing the same thing SingleFile add-on does: generate single-file HTMLs with bundled resources viewable directly in the browser.

Pros:

Cons:

single-file-cli

Stand-alone tool based on SingleFile, using a headless browser to capture pages.

A more robust solution to do what monolith and obelisk do, if you don't mind nodejs and the need to run a headless browser.

heritrix

The crawler behind the Internet Archive.

It's a self-hosted web app into which you can feed the URLs for them to be archived, so to make it archive all of your web browsing:

Pros:

Cons:

Archivy

A self-hosted wiki that archives pages you link to in background.

Others

ArchiveBox wiki has a long list or related things.

If you like this, you might also like

Perkeep

It's an awesome personal private archival system adhering to the same philosophy as Hoardy-Web, but it's basically an abstraction replacing your file system with a content-addressed store that can be rendered into different "views", including a POSIXy file system.

It can do very little in helping you actually archive a web page, but you can start dumping new Hoardy-Web .wrr files with compression disabled, decompress you existing .wrr files, and then feed them all into Perkeep to be stored and automatically replicated to your backup copies forever. (Perkeep already has a better compression than what Hoardy-Web currently does and provides a FUSE FS interface for transparent operation, so compressing things twice would be rather counterproductive.)

Meta

Changelog?

See CHANGELOG.md.

TODO?

See the bottom of CHANGELOG.md.

License

GPLv3+, some small library parts are MIT.

Contributions

Contributions are accepted both via GitHub issues and PRs, and via pure email. In the latter case I expect to see patches formatted with git-format-patch.

If you want to perform a major change and you want it to be accepted upstream here, you should probably write me an email or open an issue on GitHub first. In the cover letter, describe what you want to change and why. I might also have a bunch of code doing most of what you want in my stash of unpublished patches already.