Deduplicate pages - Githubissues

Treora commented 7 years ago

When the same page is visited multiple times, it is silly to create a new copy in the database for every visit. A non-trivial question however is when to consider two pages to be the same:

The easy case: if they have the same url and the same content (compare ETags or hashes), they are the same.
The easy opposite case: if they have different urls and different content, they are not the same.
The difficult case: if they have the same url, but the content has changed; we may want to consider them the same if they are very similar (e.g. only ads changed), but different if we consider a news site's front page.
Also, the case nobody cares about: if they have different urls but the same content. Though it may not happen often enough to waste much space, there is an interesting aspect: spotting the sameness would allow e.g. reattaching the same notes. Probably something for later though.

Determining sameness can be done very crudely at first, and improved later to better detect the almost-the-same situations in case 3 above (and maybe case 4).

blackforestboi commented 7 years ago

I can imagine that detecting content "sameness" can be quite difficult, if there is content on the page that is dynamic.

How can we circumvent this? With news articles, I have the impression that readability does a quite decent job at cutting everything out, but not so much with other types of HTML content.

I don't know if that idea already came up, but how about storing the text through readability in as priority 1 text and the diff between body.content.innerText- readability_text as priority 2 text? AFAIK pouchDB quick search lets you boost certain fields.

I also liked the idea you had about creating content type (or even domain)-specific parser libraries.

Another idea, that I believe came from @bwbroersma was to just store the diffs from each visit as pure text, means they are at least searchable. We could just append them to the text field.

bwbroersma commented 7 years ago

Storing multiple visits of the same URL would be nice to detect this kind of changes (medium post about it). Just storing (compressed) diffs, even of the complete HTML text should be quite minimal, if that is not doable do a (compressed) diff on the readability DOM (preferred over just the innerText), so you store ~~changes~~ edits in a meaningful way. Using a preset deflate dictionary per language would result in a minimal size diff, but remember this only becomes a problem when size becomes an issue, and a fix can be done retroactively with an update.

blackforestboi commented 7 years ago

@Treora

Whats the plan?

Treora commented 7 years ago

The plan

The plan is to create a flexible algorithm and data model, that gives the application freedom to choose which pages to archive verbatim, which to forget the contents of, and which to pretend never existed, while references between the page objects can be used to express their relations (e.g. sameAs, updateTo). So some page objects may refer to another one and specify a diff to apply to it, some may just refer to another one and neglect to specify the differences, and again others may just say "there was a page at this url with this title but I forgot about the rest of the contents" (e.g. those pages that were imported from browser history). The behaviour may be url-dependent, can change over time, and changes can be applied retrospectively later on, to for example compress or forget things after a some months.

The algorithm

Save as new page

I am now thinking to in principle💬 create a new page object in the database every time a url is (re)visited, then analyse and store the page contents. So it always creates a duplicate at first because it is unknown whether that url still serves the same document as before.

Compare with previous visits

Directly after the content analysis, or possibly at any later time, this new page object can be compared with any previous visit(s) of that same url, to determine whether the page has remained exactly/partly/hardly the same. These algorithms can be improved later, and could use simple heuristics to start with.

Deduplicate if desired

Depending on the computed level of sameness and the application's preferences, any of several deduplication actions can then be taken:

reverting (deleting) either the older or newer page object and updating any pointers to it to the remaining one
forgetting the contents of the old (or possibly the new) page object but still keeping it in the database, instead adding a field saying "look at page object instead, it is the same" (or "same enough", or "possibly different, but who cares")
the same as action 2 but also adding the diff between them (i.e. "look at page object but apply this diff")
or no action, thus storing both separately without deduplication.

Currently the code always performs the last one, without doing any comparison. A similarly easy and perhaps more common approach would be to hardcode the first one for every visit (i.e. also skipping comparison and always overwriting old version). But I think the sweet spot is often in the middle.

Note that any of the four actions can be taken for any level of sameness. Some combinations (e.g. action 1 or 2 when a page was significantly changed) will result in a loss of knowledge, but the application may find that acceptable.

Implementation

I intend to first implement the easy parts. Storing e.g. compressed diffs is a nice idea, and so is creating an advanced perceived sameness detection, but they can wait. As a beginning, it may be okay to always take action 2, i.e. replacing an old page's contents with a pointer to a newly created page object, so only the most recent versions are stored. As an exception action 4 (plain duplication) could be applied on pages that have been pinned/bookmarked manually or match some other rules we may wish to specify (e.g. pages having an empty url path). Also, we could pragmatically decide to reuse an existing page object (let's call that action 0) if a url is revisited the same day.

This is more or less the idea as it has developed in my mind and that I started implementing. Feedback is welcome. Maybe some Benjamin (@bwbroersma, @bigbluehat) may already see holes in the plan or know of existing work to build on?

bohrium272 commented 7 years ago

The above algorithm seems pretty good. What I and @oliversauter had discussed earlier was to save articles using hashes like md5 etc. That way each article gets a unique entry in the database. To handle multiple visits to the same article, an array of visit times is stored. I really like the diff tree implementation of the page content. However the database will have to be changed to create a link between the visit times and the corresponding keywords.

blackforestboi commented 7 years ago

To handle multiple visits to the same article, an array of visit times is stored.

I think this is the way you intend it in your current model, right @Treora?

Treora commented 7 years ago

Each visit is a separate object in the database, with a pointer to the page that was visited. I am not certain about using hashes as identifiers. It only makes sense if the data is not changing, while in the current model the page object can be updated with e.g. improved content analysis. Addressing immutable objects with hashes is nice for content-addressable storage, and I am still breaking my brains on whether and how to make use of systems like IPFS&IPNS for storing and/or publishing one's documents. This may have to wait a while though..

Treora commented 7 years ago

Basic implementation provided in #60, closing this now. Improvements can be added as new issues.

WebMemex / webmemex-extension

Deduplicate pages #22

The plan

The algorithm

Save as new page

Compare with previous visits

Deduplicate if desired

Implementation