Simpler data model for page objects

Treora commented 7 years ago

On the one hand, I like to store as much as possible of the available information, as it might be useful later. On the other hand, this practice creates complexity, and we end up with old fields lingering there.

Likewise, on the one hand I like to keep clear how each piece of information was obtained; it is nice to know whether information was extracted from the page itself, or inferred otherwise; e.g. to keep the URL we visited separate from the canonical url the page reports (and perhaps whether that came from a og:url meta tag, or from a rel=canonical link tag). On the other hand again, is it worth the complexity?

I feel Readability was not worth the complexity. It is not reliable enough to not also store the full body.innerText, leading to a waste of space. And having freeze-dry implemented we don't need it anymore for showing stored pages.

I then just felt like stripping more of the complexity and just keeping a few data fields, directly in the page object. I now somewhat arbitrarily chose: fullText, title, url, canonicalUrl, author, description, keywords. Data extracted from pdfs end up in the same place as from html pages (todo: add a type field).

Notice that url is the only one not extracted from the page itself (...should it only be stored in the visit?). Also, we can save space by not storing the fullText (= body.innerText) as soon as we don't rely on pouchdb-quick-search anymore; the string can then be put into the search index and forgotten; we can recover it from the frozen page if needed.

@poltak: you are quite involved with the data storage model now; I would gladly hear your thoughts.

blackforestboi commented 7 years ago

Also, we can save space by not storing the fullText

Am I correct in my understanding that this only becomes only a problem, when we want to re-index? With a persistent on-disk index, where we can add new entries without reindexing, this might be less of a problem?

It seems to me as it boils down to how much space do we save per page and how much the freeze dry version takes. Even though, we could get the text again from the freeze-dry version, we may not want to store such a version for every visited page, if it takes too much storage per page. How much space does a freeze dry version vs the body.innertext() need on disk?

...should it only be stored in the visit?

Can you expand a bit on your reasoning behind that option? Would the canonical url still be saved in the page object? In which cases is there a need for distinguishing between both versions of the URL?

Treora commented 7 years ago

this only becomes only a problem, when we want to re-index?

May be good to clarify indeed. There is no problem if the text is still available from the stored (frozen) page. But if you don't store that either, you can indeed not reindex. So, we have these levels of storage:

Store whole page (and index)
Store only text (and index)
Store no contents, but do index
(forget the thing altogether)

One can start at any of these levels and can always demote items to a lower level on this list to save space. I am currently mostly thinking of using level 1 and 4 to begin with, in order to keep things simple: you search through your personal web; you don't find things you don't have. However, level 2 and 3 are definitely useful, at least to save space for documents that can be assumed available online/elsewhere, and later also even just to find their context (e.g. linked pages).

Then shortly about the reasoning behind storing url only in visit objects: it is now stored in both the visit object and the page object; we could deduplicate it so all fields in the page object would be extracted content. (only in the hypothetical case that one would aggresively deduplicate pages such that visits to different urls point at the same page object, it might add value to keep the url field, to know the actual url that the page object data was extracted from)

blackforestboi commented 7 years ago

@Treora Thanks for clarifying

if the text can still be read from the stored (frozen) page

Can it currently be read from the frozen page or not? A bit ambigious because of the "if". How much space does a frozen page need currently?

Treora commented 7 years ago

Text can of course be read from a frozen page; the if-question is whether there is one. Updated my wording above to make this clearer.

How much space does a frozen page need currently?

Very page-dependent, have not analysed yet, feel free to do so. :)

blackforestboi commented 7 years ago

feel free to do so.

I checked the data field (pastebin) of the html attachment for this page here: https://www.globalcitizen.org/de/content/women-interrupted-men-app/

When checking for the bytesize of the data string is says 0.01MB. Pure HTML is 0.04MB of the same page and body.innerText is 0.006MB.

Text can of course be read from a frozen page; the if-question is whether there is one

That causes even more confusion :P What do you mean with whether there is one? Can we read it out of the data field inside the attachment or does it has to be stored explicitly in the attachment (=whether there is one) ? If we can read it out of the data field, is that performant when we re-index, because we'd have to load and interpret the whole data string before getting out the text? For me it loads about 1 second to just display the string in the console and about 3 seconds to load the page when entering via the overview.

Treora commented 7 years ago

@oliversauter: I hope you don't mind me suggesting to discuss your remaining questions elsewhere. The point here was the current data model simplification. I just wrote up some of the reasoning behind it, hoping that maybe @poltak would like to think along(?).

poltak commented 7 years ago

@Treora Yes, got familiar with this yesterday, however I couldn't really think of any worthwhile discussion to add, apart from "it looks nicer/simpler" (?) 😕 however it did prompt me to get familiar with the freeze-page stuff, which seems nice even if only for the sake of not needing readability (both here for extraction and for viewing) and being able to be more flexible with the direction that takes.

One thing that I noticed and confused about at first was the reduced amount of searchable fields now, although looking more into a some sample page docs and what those fields actually mean, it doesn't seem like it would impact the search feature in any serious negative way.

So, sorry if I haven't added anything meaningful to this, but it seems nicer just in the way that it will flatten out the structure for page model and cleans it up overall, and I see no major problems nor have any questions about your reasons behind this; I like the change.

Although I agree with just removing those backwards compat lines in revisePageFields; no real point there IMHO

Treora commented 7 years ago

Merged, with a small change: after some thought I decided to put the extracted content in a nested object, content, because it seems cleaner to keep the content that was extracted from the page separate from (meta)data added by the application (e.g. a redirection link after deduplication). A reason against this was to keep an easy move to json-ld possible, but it looks like json-ld 1.1 introduces the possibility to @nest attributes.

WebMemex / webmemex-extension

Simpler data model for page objects #101