Closed Treora closed 7 years ago
I can imagine that detecting content "sameness" can be quite difficult, if there is content on the page that is dynamic.
How can we circumvent this? With news articles, I have the impression that readability does a quite decent job at cutting everything out, but not so much with other types of HTML content.
I don't know if that idea already came up, but how about storing the text through readability in as priority 1 text and the diff between body.content.innerText
- readability_text
as priority 2 text? AFAIK pouchDB quick search lets you boost certain fields.
I also liked the idea you had about creating content type (or even domain)-specific parser libraries.
Another idea, that I believe came from @bwbroersma was to just store the diffs from each visit as pure text, means they are at least searchable. We could just append them to the text
field.
Storing multiple visits of the same URL would be nice to detect this kind of changes (medium post about it). Just storing (compressed) diffs, even of the complete HTML text should be quite minimal, if that is not doable do a (compressed) diff on the readability DOM (preferred over just the innerText), so you store changes edits in a meaningful way. Using a preset deflate dictionary per language would result in a minimal size diff, but remember this only becomes a problem when size becomes an issue, and a fix can be done retroactively with an update.
@Treora
Whats the plan?
The plan is to create a flexible algorithm and data model, that gives the application freedom to choose which pages to archive verbatim, which to forget the contents of, and which to pretend never existed, while references between the page objects can be used to express their relations (e.g. sameAs, updateTo). So some page objects may refer to another one and specify a diff to apply to it, some may just refer to another one and neglect to specify the differences, and again others may just say "there was a page at this url with this title but I forgot about the rest of the contents" (e.g. those pages that were imported from browser history). The behaviour may be url-dependent, can change over time, and changes can be applied retrospectively later on, to for example compress or forget things after a some months.
I am now thinking to in principle💬 create a new page object in the database every time a url is (re)visited, then analyse and store the page contents. So it always creates a duplicate at first because it is unknown whether that url still serves the same document as before.
Directly after the content analysis, or possibly at any later time, this new page object can be compared with any previous visit(s) of that same url, to determine whether the page has remained exactly/partly/hardly the same. These algorithms can be improved later, and could use simple heuristics to start with.
Depending on the computed level of sameness and the application's preferences, any of several deduplication actions can then be taken:
Currently the code always performs the last one, without doing any comparison. A similarly easy and perhaps more common approach would be to hardcode the first one for every visit (i.e. also skipping comparison and always overwriting old version). But I think the sweet spot is often in the middle.
Note that any of the four actions can be taken for any level of sameness. Some combinations (e.g. action 1 or 2 when a page was significantly changed) will result in a loss of knowledge, but the application may find that acceptable.
I intend to first implement the easy parts. Storing e.g. compressed diffs is a nice idea, and so is creating an advanced perceived sameness detection, but they can wait. As a beginning, it may be okay to always take action 2, i.e. replacing an old page's contents with a pointer to a newly created page object, so only the most recent versions are stored. As an exception action 4 (plain duplication) could be applied on pages that have been pinned/bookmarked manually or match some other rules we may wish to specify (e.g. pages having an empty url path). Also, we could pragmatically decide to reuse an existing page object (let's call that action 0) if a url is revisited the same day.
This is more or less the idea as it has developed in my mind and that I started implementing. Feedback is welcome. Maybe some Benjamin (@bwbroersma, @bigbluehat) may already see holes in the plan or know of existing work to build on?
The above algorithm seems pretty good. What I and @oliversauter had discussed earlier was to save articles using hashes like md5 etc. That way each article gets a unique entry in the database. To handle multiple visits to the same article, an array of visit times is stored. I really like the diff tree implementation of the page content. However the database will have to be changed to create a link between the visit times and the corresponding keywords.
To handle multiple visits to the same article, an array of visit times is stored.
I think this is the way you intend it in your current model, right @Treora?
Each visit is a separate object in the database, with a pointer to the page that was visited. I am not certain about using hashes as identifiers. It only makes sense if the data is not changing, while in the current model the page object can be updated with e.g. improved content analysis. Addressing immutable objects with hashes is nice for content-addressable storage, and I am still breaking my brains on whether and how to make use of systems like IPFS&IPNS for storing and/or publishing one's documents. This may have to wait a while though..
Basic implementation provided in #60, closing this now. Improvements can be added as new issues.
When the same page is visited multiple times, it is silly to create a new copy in the database for every visit. A non-trivial question however is when to consider two pages to be the same:
Determining sameness can be done very crudely at first, and improved later to better detect the almost-the-same situations in case 3 above (and maybe case 4).