Improve Just Read's selection algorithm

ZachSaucier commented 3 years ago

Just Read's current content selection algorithm (found in the getContainer function) can be summarized as:

Get the paragraph tag with the most words on the page. This tends to be a paragraph in the article's content.
Count how many words are on the page total.
Crawl up the parent tree of the paragraph (obtained in step 1) until at least 2/5ths of the total words (obtained in step 2) have been reached.
Use the container that contains at least 2/5th of the total words as the article container.

I also do a manual removal of some irrelevant items after the content has been selected (which is what hide-segments.css does).

Just Read's selection algorithm works well on the majority of web pages but falls short in a few circumstances. Sometimes it:

Selects the wrong content which is especially common where there are long comments, a lot of comments, or multiple articles on a page.
Selects only a portion of the content which is especially common when an article is broken down into sub-sections.
Is overly-inclusive which is especially common when there are a lot of words on a page outside of the article content.

There is definitely room for improvement here but it's a non-trivial problem due to the poor markup and variety of markup that webpages have. Any suggestions for how Just Read's content selection algorithm could be improved?

If you're wanting a list of websites to test with, please try out the list of problematic articles in the issues listed above.

Additionally, try sites with poor markup like:

https://ibhof.blogspot.com/2020/01/the-economics-of-pirate-radio.html

Ahmed-Ali commented 3 years ago

I am working on a side-project that would benefit from a smart content selection algorithm. I tried several existing one, and tried to mimic Firefox Readability's algorithm (had to rewrite it in different language back then). However, my finding by far is that it is not a problem that can be solved by "one size-fits-all" solution; at least not without a very well trained machine learning (which would require massive amount of data, let alone I am not really expert in that field).

The most reasonable approach I landed on is 1- Start with the best existing algorithm; for the type of websites I am looking at, mostly tech websites, your algorithm seems to work the best (Firefox's aggressively cuts legit content, and similarly Google's Dom-Distiller falls short in websites like O'Reilly content + the fact that it is Java, makes it bit limiting) 1.a - Maybe use more than one content selection algorithm if some of them perform better than other for different set of websites 2- Use manually added configs per website (i.e have a map between website and a predefined clean-up configurations) 3- When trying to remove the boilerplate from a website, lookup the predefined set of configs first; if exists, use it. If not, use the content selection algorithm.

the predefined set can be as simple as a JSON object describing what tags to remove, or what tag to consider as the main container, etc

ZachSaucier commented 3 years ago

Thanks for the feedback, Ahmed!

The difficulty of using different algorithms or combining different ones is that it's very hard to determine whether or not or under which circumstances to combine or switch out the method being used. If you have any insight as to how to do that with success please let me know 🙂

As for manual configs per website, Just Read Premium actually has that functionality built into it. You can specify selectors for the content, header image, title, etc. on a per-site basis.

As for using a boilerplate/standardization, the first approach I took for Just Read's selection algorithm was one that used the "standard" meta information and semantic elements. I quickly found that people abused that/didn't follow the specifications and spent time coming up with the approach that Just Read now has. If it could be combined with the current approach that could be helpful, but again, it's hard to combine approaches 🙂

Ahmed-Ali commented 3 years ago

The difficulty of using different algorithms or combining different ones is that it's very hard to determine whether or not or under which circumstances to combine or switch out the method being used

I was thinking of more like giving the user the option to switch between them manually to see what works best for them. With time, one can gather some data (i.e some counters) to do some automated ranking to chose the default choice. The challenge will be to chose the right UX to avoid confusing the user

As for manual configs per website, Just Read Premium actually has that functionality built into it. You can specify selectors for the content, header image, title, etc. on a per-site basis.

That sounds like what I am thinking of, and maybe you can also take it a one step further by sharing this configurations after approving manually.

For example:

User opens website X and do manual configuration
Configuration alongside some relevant information (mainly the webpage address) is sent to some backend endpoint
You review the configuration, and modify as needed, and set it as the default configuration for this website

Next time a user browse the same webpage, they won't need to do the same thing manually as they are likely to be satisfied with the default configurations

ZachSaucier commented 3 years ago

I was thinking of more like giving the user the option to switch between them manually to see what works best for them. With time, one can gather some data (i.e some counters) to do some automated ranking to chose the default choice. The challenge will be to chose the right UX to avoid confusing the user

Yeah, I don't think this is a great option. Just Read already has user selection mode which lets users select the exact content they want to read. Making them go between a auto-selections will likely only bring confusion and be more work for the user.

maybe you can also take it a one step further by sharing this configurations after approving manually.

Interesting suggestion. At the very least this might make sense for any larger websites that Just Read fails on. A down side that I can think of at this point is that selectors change over time so it could end up being worse in some cases if that happens. But that probably won't happen often. I'll have to spend some more time thinking about this.

ZachSaucier commented 3 years ago

I came across this article which has a summary of some different methods of selecting article content on the web. Sadly it doesn't have any direct comparisons of the accuracy or performance of the methods. It also doesn't have ready-made implementations which would have been super helpful.

Ahmed-Ali commented 3 years ago

I know of couple of them: The arc90 resonates well with Firefox's Readability (it is open source, and there are several implementations only) - however, I find it too aggressive (or at least the existing implementations are) as they take down images unreasonably from legit content; The Boilerpipe is implemented by Google in their Dom Distiller repo and it is my second favorite by far. My only take on it is that it is written in Java (makes it hard to be portable without JVM, i.e can't work natively on iOS), and in rare occasions also cut out legit content (specifically from good content like O'Reilly pages).

Perfect algorithm is yet to be found :)

bradmurray commented 2 years ago

It seems like any page on axios.com is just coming up blank in firefox.

example: https://www.axios.com/progress-cop26-glasgow-methane-deforestation-8b5ace1f-41e7-4a98-a257-8e560de0e1e4.html

ZachSaucier commented 2 years ago

It seems like any page on axios.com is just coming up blank in firefox.

example: https://www.axios.com/progress-cop26-glasgow-methane-deforestation-8b5ace1f-41e7-4a98-a257-8e560de0e1e4.html

Those are some extremely short articles... Hence Just Read's algorithm not recognizing them as articles.

That site does have an interesting meta attribute that I don't remember seeing before: articleBody. I can update Just Read to use that attribute if it exists which would fix this case.

ZachSaucier commented 2 years ago

FYI Just Read will now check if an articleBody meta element exists after user's selection but before Just Read's "normal" auto-selection.

ZachSaucier commented 6 months ago

Just Read's auto-selection should get significantly better in the next version! I am switching to use Mozilla's Readability.js for auto-selection. As such, I'm closing this issue as completed.

Readability.js did a better job of auto-selecting the content on these articles:

Both selection methods did badly on:

https://www.thegospelcoalition.org/article/7-steps-to-conflict-resolution/

ZachSaucier / Just-Read

Improve Just Read's selection algorithm #331