Framework for page deduplication

Treora commented 7 years ago

As discussed in issue #22.

A lot of code to do relatively little, but it provides the basic framework and a simple first implementation for checking whether two pages that were presumably retrieved from the same URL are the same document, completely different pages, or something in between (the more recent one presumably being an update to the other). A complex workaround to the lack of versioning in URLs.

Ideally we could tell directly whether a page is the same as last time or not, for example by checking the ETag HTTP header, if present. The code needed for such checks still has to be implemented, but the required boilerplate is included here (tryReidentifyPage in src/page-storage/store-page.js).

If a page was not reidentified beforehand, a new page doc is created in the database, and the usual page analysis is run on it. After this the analysed page contents are compared against the candidate page (= the previous page we got from the URL), currently by a simple text comparison between the body.innerText and title of the two pages (see src/page-storage/sameness.js). A level of 'sameness' is determined (rather subjectively), and this level is then used for deciding what to do with the two (in src/page-storage/deduplication.js). Currently the two possible actions are to either keep them both as-is, or to forget the analysed contents of the older page and add a seeInstead field that redirects any reader to the new page; much like HTTP redirects. Which actions are taken for which sameness levels can be worked out in more detail later.

If anybody would like to share their thoughts on the sanity of the direction of this approach, I would be glad to hear. There are some important design choices embedded in here. Regard this code as a first iteration though, I hope the approach will flesh out and evolve further over time.

obsidianart commented 7 years ago

It seems indeed a lot of code to do one thing. It might be worth doing it for future development but as for today my impression is that 50% of it is currently unused and you just end up either replacing a page or not. What is the final picture of this approach? What do you want to give the user?

Treora commented 7 years ago

What is the final picture of this approach? What do you want to give the user?

I want to bring versioning to the web, and provide users with a model of webpages that matches the way people think about it. The view that one URL identifies one document is simply wrong. The current approach, that each time you dereference a URL you get a completely unrelated document, swings too far to the other side. We need something in between.

Today's newspaper frontpage is a different document than yesterday's, though they share the same URL. Today's view of a specific article may be still be the same article however, even though the advertisements around it changed. And tomorrow the article may have had a small correction, making it a revision of the same document.

obsidianart commented 7 years ago

ok, it's a similar approach to the internet archive. Good idea.

WebMemex / webmemex-extension

Framework for page deduplication #60