[Feature] Some sites block scraping content without javascript.

FreshRSS / FreshRSS

A free, self-hostable news aggregator…

https://freshrss.org

GNU Affero General Public License v3.0

8.77k stars 780 forks source link

[Feature] Some sites block scraping content without javascript. #6447

Closed sherlcok314159 closed 3 weeks ago

sherlcok314159 commented 1 month ago

Some sites can not be scraped without javascript. And I tried different useragents such as curl/8.21. All the useragents failed.

Site: https://rsshub.app/zhubai/posts/havefun

Alkarex commented 1 month ago

You can try with https://github.com/lwthiker/curl-impersonate/ , which sometimes help. Otherwise you will need a more sophisticated system.

sherlcok314159 commented 1 month ago

Thanks. But how can I combine this with freshrss?

Alkarex commented 1 month ago

A typical way is to use a system such as RSS Bridge, which outputs an RSS feed, which can be consumed by FreshRSS. But first step is to find an approach that works manually.

squromiv commented 3 weeks ago

Some sites can not be scraped without javascript

Try feedless tool. It can help in some cases.

sherlcok314159 commented 3 weeks ago

Thanks for the above replies. My solution is to use a local headless browser to handle this by python. It is quite light.