feediron / ttrss_plugin-feediron

Evolution of ttrss_plugin-af_feedmod
https://discourse.tt-rss.org/t/plugin-update-feediron-v1-2-0/2018
MIT License
206 stars 34 forks source link

Add pause / wait to load before scrape #111

Closed JReming85 closed 4 years ago

JReming85 commented 6 years ago

Expected Behavior

I am rewriting certain URLs to goto outline.com/https://website.com

However outline.com takes a few moments to clean it up and display the results. Is there anyway to halt the scrape until it finishes loading / bypassing paywalls, etc

Current Behavior

Scrapes the loading page

Steps to Reproduce

URL - https://www.wsj.com/articles/the-nfls-best-players-are-getting-richer-than-ever-1536163544

{ "type": "xpath", "xpath": [ "div[@class='article-wrapper']" ], "reformat": [ { "type": "regex", "pattern": "\/.+.com\/", "replace": "https:\/\/outline.com\/https:\/\/wsj.com" } ] }

dugite-code commented 6 years ago

Currently There is no way to add a delay into the html body fetch. I have hacked php-curl into feed iron in the past by adding it into The Function at Line 271. That said I'm not 100% sure you could get the desired result from curl.

The other idea I had been working on, but have put on hold for the moment I mentioned #38. Adding the ability to call phantomjs of selenium. But these are potentially complex and will require significant re-works of the code-base to integrate. I might re-visit them when I can break configs in version 2