Title is set incorrectly, too much content is saved for infinite scroll blogs

blackforestboi commented 6 years ago

If reporting a bug:

Can you describe the problem and bug in more detail?

For blogs that have an infinite scroll, like qz.com, scrolling through the articles, it incorrectly stores the title of the previous article and stores all the content of that previous article as well.
How can we replicate the issue?

Option 1:
Go on qz.com
Pick an article
go on search overview
Title is "Quartz" not the title of the article

Option 1:

Go on qz article directly:
Scroll to next article (see url change)
Wait for index
Go to search overview
See 2 articles with same title, correct url
search for word on first article
Get both results

Interestingly, it seems as if the title is indexed correctly only for the second article > if you search for the words in the title of the second one, only that one appears in the results. It seems to store them incorrectly in pouch, where it gets the information from when rendering the results.

Expected behavior (i.e. solution)
Each url should have the correct title displayed,
For the second article, it should not index the words from the first article as well.
Error stack (from extension crash page)

No error message
Other comments

poltak commented 6 years ago

In regards to the terms content on sites like qz.com: basically how those article pages work is as the user scrolls, URL state is updated and the DOM gets appended to with the new article. Most dynamic sites work by replacing a lot of the current DOM as the URL state changes (concept of separate pages).

The way we extract this content (from document.body) means that as you scroll down, the DOM will grow, hence the subsequent page visit events are going to also grow in input size. It may require we introduce some deduping-like check back into page visits.

blackforestboi commented 6 years ago

It may require we introduce some deduping-like check back into page visits.

Can we somehow detect that growth of the dom and cut things off? e.g. by detecting the h1 tags or such?

poltak commented 6 years ago

Certainly possible to observe DOM changes but would be a non-trivial task. And coming up with ways to quantify and perform differences on the DOMs to filter out old data. Deduping the data is probably a more feasible strategy but adds a level of complexity to all page visits if we bring something like that back in.

The Time and NBC news sites you linked don't seem to have that same dynamic scrolling behaviour as the quartz site. Is there specific articles you can link that has it, maybe I didn't find them?

On 3 Jan 2018, at 17:26, Oliver Sauter notifications@github.com wrote:

It may require we introduce some deduping-like check back into page visits.

Can we somehow detect that growth of the dom and cut things off? e.g. by detecting the h1 tags or such?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/WorldBrain/Memex/issues/225#issuecomment-354979477, or mute the thread https://github.com/notifications/unsubscribe-auth/ABFA3Aa1xy_acfrHTwAIARD3KwkuKe_Zks5tG1XXgaJpZM4RG_sf.

blackforestboi commented 6 years ago

closing this for now, as parts of it are fixed. Still open:

[ ] Separating the content of infinity scroll pages

WorldBrain / Memex

Title is set incorrectly, too much content is saved for infinite scroll blogs #225