RSS-Bridge / rss-bridge

The RSS feed for websites missing it
https://rss-bridge.org/bridge01/
The Unlicense
7.37k stars 1.04k forks source link

Adopt WebDriverAbstract as a solution for active (JavaScript) websites #3970

Open hleskien opened 9 months ago

hleskien commented 9 months ago

Hello everyone,

A few weeks ago I came across the well-known problem that rss-bridge doesn't work for some websites. This is always the case when the website loads some content via XMLHttpRequest (XHR) and / or changes it using JavaScript. After some experimentation and thinking about the problem, I came to the conclusion that it can only be solved using a web browser. Coincidentally, I already had a lot of experience with Selenium and the WebDriver and decided to come up with a solution.

I now suggest that the WebDriverAbstract class be imported into the project, temporarily in its own branch. As examples I have written two bridges that use these.

The following problems still need to be solved:

Kind regards,

Holger

dvikan commented 9 months ago

hello thanks for issue and pr. i can help you make this happen

I didn't commit my composer.lock because the changes look random and I didn't want to invest time (yet) to understand composer.

At the moment, end users do not use composer and/or composer.json/composer.lock. End users simply extract rss-bridge into a folder and everything works out of the box. This has been the standard way of php since the 90s.

The current usage of composer is to fetch developer dependencies such as phpunit and phpcs (linter).

Current solution to thirdparty dependencies is to commit them to repo in vendor folder.

* [ ]  I have added the php-webdriver dependencies to `bootstrap.php`, but they are not complete. They look messy and unsorted because of the dependencies between them. I suggest to use the autoloader.

We can commit this dependency to vendor folder manually, or we can start to require end users to use composer for installation.

* [ ]  All bridges that depend on WebDriverAbstract would create high(er) load on a multi-user (open) server. Maybe there should be an option to disable the class instead of each bridge using it.

Should have a setting to disable this. Or some way to quickly disable these resource heavy bridges.

* [ ]  Because loading web pages through a browser is such a heavier load, it would make sense to stop scraping as soon as an element is in a cache (assuming the elements are sorted chronologically). Is there already a cache for elements that have already been scraped? How can I use it? I looked around the code but didn't find any examples of this or documentation.

The output of bridges are cached by default 1h. Configurable by bridge with const CACHE_TIMEOUT.

In addition, we have cache for more fine-grained stuff. It is a field on BridgeAbstract and can be accessed like this:

$this->set('foo', 'bar');
$val = $this->get('foo');

See https://github.com/RSS-Bridge/rss-bridge/blob/master/lib/CacheInterface.php

* [ ]  What do you miss in the documentation for WebDriverAbstract?

Lets postpone docs. Lets make it work first, then make good docs.

Please can you write linux/debian non-docker instructions on how to set up the selenium server on localhost.

hleskien commented 9 months ago

Please can you write linux/debian non-docker instructions on how to set up the selenium stuff.

For development? That's easy: just install the Debian package chromium-driver (+ chromium). Since it is the base of Chrome, it is fully compatible.

A full Selenium Grid for production would be more complicated.

dvikan commented 9 months ago

recent commits in master allows install of php-webdriver/webdriver using composer:

composer require --update-no-dev php-webdriver/webdriver

or if you also want dev deps:

composer require php-webdriver/webdriver
dvikan commented 9 months ago
$ sudo apt install chromium-driver
$ chromedriver --port=4444 --verbose
hleskien commented 9 months ago

At the moment, end users do not use composer and/or composer.json/composer.lock. End users simply extract rss-bridge into a folder and everything works out of the box. This has been the standard way of php since the 90s.

The current usage of composer is to fetch developer dependencies such as phpunit and phpcs (linter).

Current solution to thirdparty dependencies is to commit them to repo in vendor folder.

Ah, I assumed that this was due to performance considerations. I've been using PHP autoloaders since the early 2010s and I think it's a good idea, especially to avoid dependency hell.

I am against checking in libraries, partly because of the duplication, but also because then you are responsible for keeping this code up to date, especially if there are security holes.

We can commit this dependency to vendor folder manually, or we can start to require end users to use composer for installation.

How do end users use rss-bridge? I would think that most download the Docker image, a much smaller proportion build the Docker image themselves, and the even smaller remainder use the repository via a web server. This rest should also be able to install the necessary packages locally and invoke composer manually.

Should have a setting to disable this. Or some way to quickly disable these resource heavy bridges.

At the moment I don't know enough of the source code to write a patch for it.

The output of bridges are cached by default 1h. Configurable by bridge with const CACHE_TIMEOUT.

In addition, we have cache for more fine-grained stuff. It is a field on BridgeAbstract and can be accessed like this:

Thank you. I will try this for GULPProjekteBridge.

dvikan commented 9 months ago

I think it would be very nice with a function e.g. getContentsWithWebDriver() that opens page, wait for load and them dumps html. Example:

<?php

class BearBlogBridge extends BridgeAbstract
{
    const NAME = 'BearBlog (bearblog.dev)';

    public function collectData()
    {
        $dom = getContentsWithWebDriver('https://herman.bearblog.dev/blog/');
        foreach ($dom->find('.blog-posts li') as $li) {
            $a = $li->find('a', 0);
            $this->items[] = [
                'title' => $a->plaintext,
                'uri' => 'https://herman.bearblog.dev' . $a->href,
            ];
        }
    }
}

This very basic function could solve lots of problems with dynamic DOM.

Is this possible?

dvikan commented 9 months ago

How do end users use rss-bridge? I would think that most download the Docker image, a much smaller proportion build the Docker image themselves, and the even smaller remainder use the repository via a web server. This rest should also be able to install the necessary packages locally and invoke composer manually.

Im not actually sure but most probably use the docker image from docker hub.

Should have a setting to disable this. Or some way to quickly disable these resource heavy bridges.

At the moment I don't know enough of the source code to write a patch for it.

Dont worry about it.

hleskien commented 9 months ago

This very basic function could solve lots of problems with dynamic DOM.

Is this possible?

Yes, that would certainly make things easier, but unfortunately, as far as I know, there are no methods for doing this. The DOM is like a fluid in motion. Turning it back into HTML would be possible, but also difficult.

Scraping a webpage using the WebDriver is simply different.

dr0id123 commented 9 months ago

This very basic function could solve lots of problems with dynamic DOM.

Is this possible?

Following this thread with great interest. I'm no programmer, but have you seen this article?

https://www.zenrows.com/blog/selenium-php#parse-the-data-you-want

There is code for scaper.php which does exactly what is being asked here -- returns the html of the page. (e.g., $html = $driver->getPageSource();)

hleskien commented 9 months ago

There is code for scaper.php which does exactly what is being asked here -- returns the html of the page. (e.g., $html = $driver->getPageSource();)

getPageSource() returns the HTML source loaded into the browser, but not the DOM. The DOM turned back into HTML is what we are interested in. The rest of the article does what I did.

dr0id123 commented 9 months ago

Thanks for clarifying.

I'm attempting to try this out, but unfortunately I'm new to this.

I try to run one of these bridges and get the error:

Type: Error Code: 0 Message: Class 'Facebook\WebDriver\Remote\RemoteWebDriver' not found File: lib/WebDriverAbstract.php Line: 104

I gather this related to the dependency points above. I realize this isn't a support thread, but any documentation or help would be appreciated.

hleskien commented 9 months ago

Type: Error Code: 0 Message: Class 'Facebook\WebDriver\Remote\RemoteWebDriver' not found File: lib/WebDriverAbstract.php Line: 104

That is a dependency error. The main branch is currently broken in regards of WebDriver. I will write a fix as soon I have time for that. But for a quick fix you can append this to the $files array lib/bootstrap.php (after line 27):

    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverSearchContext.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriver.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/JavaScriptExecutor.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverHasInputDevices.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/RemoteWebDriver.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverCapabilities.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/WebDriverCapabilityType.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/WebDriverBrowserType.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverPlatform.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/DesiredCapabilities.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Chrome/ChromeOptions.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Firefox/FirefoxOptions.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverCommandExecutor.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/DriverCommand.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/HttpCommandExecutor.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Local/LocalWebDriver.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Firefox/FirefoxDriver.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/WebDriverCommand.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/PhpWebDriverExceptionInterface.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/Internal/UnexpectedResponseException.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/Internal/WebDriverCurlException.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/WebDriverResponse.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverOptions.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/ExecuteMethod.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/RemoteExecuteMethod.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverWindow.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverWait.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverExpectedCondition.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverBy.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/JsonWireCompat.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverElement.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Internal/WebDriverLocatable.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/RemoteWebElement.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/WebDriverException.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/NoSuchElementException.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/FileDetector.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/UselessFileDetector.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Support/IsElementDisplayedAtom.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Support/ScreenshotHelper.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverTimeouts.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/UnknownErrorException.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverPoint.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverMouse.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/RemoteMouse.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/MoveTargetOutOfBoundsException.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Interactions/WebDriverActions.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverKeyboard.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/RemoteKeyboard.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverAction.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Interactions/WebDriverCompositeAction.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Interactions/Internal/WebDriverMouseAction.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Interactions/Internal/WebDriverMoveToOffsetAction.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Interactions/Internal/WebDriverCoordinates.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/TimeoutException.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/StaleElementReferenceException.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverKeys.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/ElementNotInteractableException.php',
    __DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/InvalidArgumentException.php',
dr0id123 commented 9 months ago

EDIT: Created the directory structure needed by pulling from php-webdriver's git. I was able to execute the bridge (needed to comment out a line calling formatItemTimestamp as "IntlDateFormatter" couldn't be found.

Works! :) Noticed that broader level feed caching also works. This will be a nice step forward for this project, not only to help with JS sites, but also it helps considerably with sites that use cloudflare. I'll have fun playing with this now and as it matures. Thanks again for your continued work on this!


Thanks, however additional errors.

.../../vendor/php-webdriver/webdriver/lib/WebDriverSearchContext.php): failed to open stream: No such file or directory.....

This folder doesn't appear in master: /vendor/php-webdriver/webdriver/

hleskien commented 9 months ago

EDIT: Created the directory structure needed by pulling from php-webdriver's git. I was able to execute the bridge (needed to comment out a line calling formatItemTimestamp as "IntlDateFormatter" couldn't be found.

Ah, that wouldn't have been necessary. Just a few comments up, @dvikan posted this command:

composer require --update-no-dev php-webdriver/webdriver

And IntlDateFormatter is a standard php class and should work right out of the box. At least it did on my machine: https://www.php.net/manual/en/class.intldateformatter.php

dvikan commented 9 months ago

IntlDateFormatter (php extension) might be missing out of the box on e.g. debian:

apt install php-intl

More info in composer.json:

    },
    "suggest": {
        "php-webdriver/webdriver": "Required for Selenium usage",
        "ext-memcached": "Allows to use memcached as cache type",
        "ext-sqlite3": "Allows to use an SQLite database for caching",
        "ext-zip": "Required for FDroidRepoBridge",
        "ext-intl": "Required for OLXBridge"
    },
dr0id123 commented 9 months ago

Feedback:

I'm no developer, so I can only test and provide feedback :)

Other: Videocardz site -- recently it has gone behind cloudflare for several regions in the world. While this won't address that issue, if you are able to access, I was able to do a full-text feed (perhaps useful as an example):

https://gist.github.com/dr0id123/0d14ff0e2a666fd6e8c949091c7dea45

Hope this helps!

dvikan commented 9 months ago

This was recently merged in master so that users can test it out. Let's call it a beta testing phase. When things are looking good ill add instructions to README.

In the meantime, all that is needed to test this is composer require --update-no-dev php-webdriver/webdriver.

I'd advise against manually modifying files in bootstrap.php or the vendor folder, unless you know what you are doing.

dr0id123 commented 9 months ago

I think I have a way that presents a path forward (in some cases):

After looking around, I came across this project: https://github.com/FlareSolverr/FlareSolverr FlareSolverr uses Selenium combined with a modified driver to help bypass Cloudflare all wrapped up in docker. It's very hard to bypass Cloudflare and it will depend on the site, configuration, etc, as we know. That said, it will work depending on the site.

Here's my working example:

https://gist.github.com/dr0id123/2863a821b8bf35864f977a3eb9dface4

Using this method can be quite slow (FlareSolverr takes time get around protections), and expensive (perf) otherwise due to selenium. Hence the caching really helps and happy the rssBridge API provides for it.

I'm no coder, I'm sure you both could have done this in a better way.....

(Works quite nicely for me!!)

hleskien commented 9 months ago

I'm no coder, I'm sure you both could have done this in a better way.....

That depends on how you look at it. In my opinion, this amounts to a game of cat and mouse in which both sides can only lose. I don't feel like investing energy into it. Web technologies work best when they are designed to be interactive. I can live with it if webmasters of modern websites no longer know what an RSS feed is. Then I'll just build one myself. But if someone sets up a walled garden, then I don't have to free the content from it at any price.

I added a path for active content sites because that's the modern web. But I have no interest in countermeasures for defensive strategies.

dr0id123 commented 9 months ago

EDIT -- Am I wrong about "*does not** solve the issue of sites that you have created bridges and this interface for" -- I'm truly new to this so sorry for basic questions....

In the JSON, the full response is within it (e.g., content of the page article). Does this mean that using the JSON output from flaresolverr is a way to get the DOM and parse it accordingly to create a feed? Is this due to it using selenium and pulling directly out of the browser?

I added a path for active content sites because that's the modern web. But I have no interest in countermeasures for defensive strategies.

Totally understandable. It is literally a cat and mouse game with tools like Flaresolverr, which I may add does not solve the issue of sites that you have created bridges and this interface for.

Ultimately, I've come to conclude there is no "all in one" answer to all of this. rssBridge provides a menu of approaches to choose from depending on the site and need. The webdriver you've added to the project is another tool in the belt and a much needed addition.

dr0id123 commented 9 months ago

@dvikan

I've taken some time to educate myself on 'dynamic DOM' and the complications that exist today with rssBridge. The work done in this pull request is awesome, but with many things code-related, there are multiple ways to get to the same objective. The implementation in this thread I think is awesome for advances cases. For other cases, I do think the project would benefit from an integration with Flaresolverr (or at least guidance with examples), not because it attempts some clouldflare mitigation, but because it uses selenium under the hood which means dynamic content is processed.

To give a clear example, using the 'scalable capital' example, here's is another way that does not require the use of php-webdriver and keeps things simple:

https://gist.github.com/dr0id123/17faa66b365f37546d93bd5f033759e3

dvikan commented 9 months ago

yes this is similar to what i wrote earlier in this thread:

$dom = getContentsWithWebDriver('https://herman.bearblog.dev/blog/');

Essentially a function which returns html string (or parsed DOM) for a dynamic page.

FlareSolverr is a Proxy server to bypass Cloudflare protection, which incidentally parses a dynamic DOM using selenium as you write.

dvikan commented 9 months ago

your example is actually also possible like this (using getContents):

        $header = ['Content-type: application/json'];
        $opts = [CURLOPT_POSTFIELDS => json_encode($data_array)];
        $html = getContents($url, $header, $opts);
dvikan commented 9 months ago

@hleskien This is from a stackoverflow answer:

In 
[PHP Selenium WebDriver](https://github.com/php-webdriver/php-webdriver)
you can get page source like this:

$html = $driver->getPageSource();

Or get HTML of the element like this:

// innerHTML if you need HTML of the element content
$html = $element->getDomProperty('outerHTML');
dr0id123 commented 9 months ago

Thank you! This helps me to simplify elsewhere. gist updated.

This works:

$header = ['Content-type: application/json'];
$opts = [CURLOPT_POSTFIELDS => json_encode($data)];
$response_data = json_decode(getContents($api_url, $header, $opts), true);
$html = $response_data["solution"]["response"];
return $html;

$header = ['Content-type: application/json']; $opts = [CURLOPT_POSTFIELDS => json_encode($data_array)]; $html = getContents($url, $header, $opts);

dvikan commented 8 months ago

how is this working for ya? is it still good?

dr0id123 commented 8 months ago

@dvikan

Preface:

Feedback:

I am not specifically using webdriver at this point for a couple of reasons:

  1. It requires dependencies of which I don't use composer (see above) which leads to dependency problems. At some point I will switch to the docker image of rss-bridge where I hope these dependency issues will be addressed.
  2. Atom feed was broken the last time I tried it with WebDriver.
  3. WebDriver - I don't have a need to use this implementation. My needs are: a. Get around Cloudflare where possible. b. Be able to create feeds; and where possible including sites with dynamic DOM. c. Create full-text feeds where possible, including sites that obfuscate div tags.
  4. To address my needs (a) and (b) above, I am using Flaresolverr as discussed in this thread. No additional code dependencies (libraries) needed to work. It is a very good no-cost solution at the moment for cloudflare, plus dynamic DOM is processed. This works with all of the other functionality of rss-bridge (caching, feedexpander, xpath).
  5. To address (c), I've been able to import "readability" to create full-text feeds where I need it as an alternative to css, xpath directly (and it works great!).

Therefore, my suggestions to this project:

  1. Incorporate 'light' support for Flaresolverr. That means option an the config file to specify connection parameters, and a wrapper function -- and if not add some documentation with an example for others.

$html = getContentsWithFlareSolverr('https://herman.bearblog.dev/blog/');

  1. Introduce 'light' support for readability -- see here: https://github.com/fivefilters/readability.php. By 'light', I mean addressing the dependency (vendors folder?) or other. As mentioned, it works very nicely!

Hope this helps.

swofl commented 1 week ago

Just found this Issue and have some observations to share.

I did my best to write a Bridge for https://runrepeat.com, but due to its dynamic javascript nature, and their forced use of a "Recommended" article auto-sort I found my way to the php-webdriver extension you developed here.

Things work rather well, using the browser to navigate the site takes a while though, and I had to adjust my RSS clients timeout accordingly. I also impose a rather low article scrape limit:

My GitHub forks RunRepeatBridge

Sometimes there are connection issues to the Selenium instance, you can see all the different kinds of errors I catch at the end of collectData(). Maybe I will add some more, this has only been in operation for a couple of days.

I am not sure how this would scale though, if integrated natively into the project. People would have to run their own selenium-grids, and my single chrome-instance already takes about 550mb of RAM. This instance is then occupied when scraping is in progress, for about 60-90s with my low limits. If someone had multiple users, with different settings scraping, you'd have to host your own selenium grid just for that.

So... I like that it's there for enthusiasts, but I don't think it scales well.