Add content scrape - Githubissues

Areskiko commented 8 months ago

Adds the ability to scrape the entry link for additional content, using the crate developed by News Flash.

Unfortunately there is no sync version of it, and it demands the usage of reqwest, so some additional dependencies have had to be added.

Currently pressing s while focused on an entry with a link will cause the content of the entry to be temporarily altered with the content scraped from the source.

Areskiko commented 8 months ago

If there is a large enough interest, I could look into rewriting the article_scraper crate, to avoid the extra dependencies

ckampfe commented 5 months ago

Thank you for this contribution @Areskiko. I appreciate it. This is definitely interesting. It's one of the weird quirks of RSS/Atom that sites will sometimes include a brief summary or even nothing at all in the content/description of the feed, forcing us to go to the site directly.

It seems like this a feature that people would appreciate!

There are some quirks to this problem/feature that I want to think about a bit as we investigate this further.

The first is that this article-scraper crate makes use of an external scraping configuration that we (russ) do not control. In terms of thinking about the security and overall safety of russ's users, I'm suspicious of this kind of implicit/transitive dependency on community-contributed configuration. I will have to investigate further exactly how this community contributed scraping configuration interacts both with the article-scraper library and how the article-scraper library itself is managed.

Secondly, and you touched on this in your comments above, article-scraper and reqwest require the use of tokio. This is an interesting issue, as I have played with converting russ to use async in the past, but never really gotten around to it in a way I found satisfactory. In fact, russ actually did use tokio at one point, and I ripped it out because it felt unnecessary for what russ was trying to be. I have nothing against tokio and reqwest in principle and in fact think they are high quality, but we do not currently use them and using 1. an async runtime in a threaded application and 2. using two different HTTP clients in the same application feels a bit weird. Again, this is not a dealbreaker necessarily, but there are a few things to evaluate in this regard: are there other "extractor"-style crates that simple can look at a given string of HTML without needing to pull in a network client? Does it make sense to try to rewrite some/all of russ to use tokio and async? I don't know the answer to these questions right now, but I will work to evaluate them.

Feel free to respond if you like, I'm happy to continue discussing this to move it forward, with the understanding that I don't feel comfortable moving forward with this specific work at this moment in time given my above concerns.

Again, thank you very much for the excellent contribution.

ckampfe / russ

Add content scrape #34