Open hleskien opened 9 months ago
hello thanks for issue and pr. i can help you make this happen
I didn't commit my
composer.lock
because the changes look random and I didn't want to invest time (yet) to understand composer.
At the moment, end users do not use composer and/or composer.json/composer.lock. End users simply extract rss-bridge into a folder and everything works out of the box. This has been the standard way of php since the 90s.
The current usage of composer is to fetch developer dependencies such as phpunit and phpcs (linter).
Current solution to thirdparty dependencies is to commit them to repo in vendor folder.
* [ ] I have added the php-webdriver dependencies to `bootstrap.php`, but they are not complete. They look messy and unsorted because of the dependencies between them. I suggest to use the autoloader.
We can commit this dependency to vendor folder manually, or we can start to require end users to use composer for installation.
* [ ] All bridges that depend on WebDriverAbstract would create high(er) load on a multi-user (open) server. Maybe there should be an option to disable the class instead of each bridge using it.
Should have a setting to disable this. Or some way to quickly disable these resource heavy bridges.
* [ ] Because loading web pages through a browser is such a heavier load, it would make sense to stop scraping as soon as an element is in a cache (assuming the elements are sorted chronologically). Is there already a cache for elements that have already been scraped? How can I use it? I looked around the code but didn't find any examples of this or documentation.
The output of bridges are cached by default 1h. Configurable by bridge with const CACHE_TIMEOUT
.
In addition, we have cache for more fine-grained stuff. It is a field on BridgeAbstract
and can be accessed like this:
$this->set('foo', 'bar');
$val = $this->get('foo');
See https://github.com/RSS-Bridge/rss-bridge/blob/master/lib/CacheInterface.php
* [ ] What do you miss in the documentation for WebDriverAbstract?
Lets postpone docs. Lets make it work first, then make good docs.
Please can you write linux/debian non-docker instructions on how to set up the selenium server on localhost.
Please can you write linux/debian non-docker instructions on how to set up the selenium stuff.
For development? That's easy: just install the Debian package chromium-driver
(+ chromium). Since it is the base of Chrome, it is fully compatible.
A full Selenium Grid for production would be more complicated.
recent commits in master allows install of php-webdriver/webdriver
using composer:
composer require --update-no-dev php-webdriver/webdriver
or if you also want dev deps:
composer require php-webdriver/webdriver
$ sudo apt install chromium-driver
$ chromedriver --port=4444 --verbose
At the moment, end users do not use composer and/or composer.json/composer.lock. End users simply extract rss-bridge into a folder and everything works out of the box. This has been the standard way of php since the 90s.
The current usage of composer is to fetch developer dependencies such as phpunit and phpcs (linter).
Current solution to thirdparty dependencies is to commit them to repo in vendor folder.
Ah, I assumed that this was due to performance considerations. I've been using PHP autoloaders since the early 2010s and I think it's a good idea, especially to avoid dependency hell.
I am against checking in libraries, partly because of the duplication, but also because then you are responsible for keeping this code up to date, especially if there are security holes.
We can commit this dependency to vendor folder manually, or we can start to require end users to use composer for installation.
How do end users use rss-bridge? I would think that most download the Docker image, a much smaller proportion build the Docker image themselves, and the even smaller remainder use the repository via a web server. This rest should also be able to install the necessary packages locally and invoke composer manually.
Should have a setting to disable this. Or some way to quickly disable these resource heavy bridges.
At the moment I don't know enough of the source code to write a patch for it.
The output of bridges are cached by default 1h. Configurable by bridge with
const CACHE_TIMEOUT
.In addition, we have cache for more fine-grained stuff. It is a field on
BridgeAbstract
and can be accessed like this:
Thank you. I will try this for GULPProjekteBridge.
I think it would be very nice with a function e.g. getContentsWithWebDriver()
that opens page, wait for load and them dumps html. Example:
<?php
class BearBlogBridge extends BridgeAbstract
{
const NAME = 'BearBlog (bearblog.dev)';
public function collectData()
{
$dom = getContentsWithWebDriver('https://herman.bearblog.dev/blog/');
foreach ($dom->find('.blog-posts li') as $li) {
$a = $li->find('a', 0);
$this->items[] = [
'title' => $a->plaintext,
'uri' => 'https://herman.bearblog.dev' . $a->href,
];
}
}
}
This very basic function could solve lots of problems with dynamic DOM.
Is this possible?
How do end users use rss-bridge? I would think that most download the Docker image, a much smaller proportion build the Docker image themselves, and the even smaller remainder use the repository via a web server. This rest should also be able to install the necessary packages locally and invoke composer manually.
Im not actually sure but most probably use the docker image from docker hub.
Should have a setting to disable this. Or some way to quickly disable these resource heavy bridges.
At the moment I don't know enough of the source code to write a patch for it.
Dont worry about it.
This very basic function could solve lots of problems with dynamic DOM.
Is this possible?
Yes, that would certainly make things easier, but unfortunately, as far as I know, there are no methods for doing this. The DOM is like a fluid in motion. Turning it back into HTML would be possible, but also difficult.
Scraping a webpage using the WebDriver is simply different.
This very basic function could solve lots of problems with dynamic DOM.
Is this possible?
Following this thread with great interest. I'm no programmer, but have you seen this article?
https://www.zenrows.com/blog/selenium-php#parse-the-data-you-want
There is code for scaper.php which does exactly what is being asked here -- returns the html of the page. (e.g., $html = $driver->getPageSource();)
There is code for scaper.php which does exactly what is being asked here -- returns the html of the page. (e.g., $html = $driver->getPageSource();)
getPageSource() returns the HTML source loaded into the browser, but not the DOM. The DOM turned back into HTML is what we are interested in. The rest of the article does what I did.
Thanks for clarifying.
I'm attempting to try this out, but unfortunately I'm new to this.
I try to run one of these bridges and get the error:
Type: Error Code: 0 Message: Class 'Facebook\WebDriver\Remote\RemoteWebDriver' not found File: lib/WebDriverAbstract.php Line: 104
I gather this related to the dependency points above. I realize this isn't a support thread, but any documentation or help would be appreciated.
Type: Error Code: 0 Message: Class 'Facebook\WebDriver\Remote\RemoteWebDriver' not found File: lib/WebDriverAbstract.php Line: 104
That is a dependency error. The main branch is currently broken in regards of WebDriver. I will write a fix as soon I have time for that. But for a quick fix you can append this to the $files array lib/bootstrap.php (after line 27):
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverSearchContext.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriver.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/JavaScriptExecutor.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverHasInputDevices.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/RemoteWebDriver.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverCapabilities.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/WebDriverCapabilityType.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/WebDriverBrowserType.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverPlatform.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/DesiredCapabilities.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Chrome/ChromeOptions.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Firefox/FirefoxOptions.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverCommandExecutor.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/DriverCommand.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/HttpCommandExecutor.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Local/LocalWebDriver.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Firefox/FirefoxDriver.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/WebDriverCommand.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/PhpWebDriverExceptionInterface.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/Internal/UnexpectedResponseException.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/Internal/WebDriverCurlException.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/WebDriverResponse.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverOptions.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/ExecuteMethod.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/RemoteExecuteMethod.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverWindow.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverWait.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverExpectedCondition.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverBy.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/JsonWireCompat.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverElement.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Internal/WebDriverLocatable.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/RemoteWebElement.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/WebDriverException.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/NoSuchElementException.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/FileDetector.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/UselessFileDetector.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Support/IsElementDisplayedAtom.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Support/ScreenshotHelper.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverTimeouts.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/UnknownErrorException.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverPoint.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverMouse.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/RemoteMouse.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/MoveTargetOutOfBoundsException.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Interactions/WebDriverActions.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverKeyboard.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Remote/RemoteKeyboard.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverAction.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Interactions/WebDriverCompositeAction.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Interactions/Internal/WebDriverMouseAction.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Interactions/Internal/WebDriverMoveToOffsetAction.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Interactions/Internal/WebDriverCoordinates.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/TimeoutException.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/StaleElementReferenceException.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/WebDriverKeys.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/ElementNotInteractableException.php',
__DIR__ . '/../vendor/php-webdriver/webdriver/lib/Exception/InvalidArgumentException.php',
EDIT: Created the directory structure needed by pulling from php-webdriver's git. I was able to execute the bridge (needed to comment out a line calling formatItemTimestamp as "IntlDateFormatter" couldn't be found.
Works! :) Noticed that broader level feed caching also works. This will be a nice step forward for this project, not only to help with JS sites, but also it helps considerably with sites that use cloudflare. I'll have fun playing with this now and as it matures. Thanks again for your continued work on this!
Thanks, however additional errors.
.../../vendor/php-webdriver/webdriver/lib/WebDriverSearchContext.php): failed to open stream: No such file or directory.....
This folder doesn't appear in master: /vendor/php-webdriver/webdriver/
EDIT: Created the directory structure needed by pulling from php-webdriver's git. I was able to execute the bridge (needed to comment out a line calling formatItemTimestamp as "IntlDateFormatter" couldn't be found.
Ah, that wouldn't have been necessary. Just a few comments up, @dvikan posted this command:
composer require --update-no-dev php-webdriver/webdriver
And IntlDateFormatter
is a standard php class and should work right out of the box. At least it did on my machine:
https://www.php.net/manual/en/class.intldateformatter.php
IntlDateFormatter
(php extension) might be missing out of the box on e.g. debian:
apt install php-intl
More info in composer.json:
},
"suggest": {
"php-webdriver/webdriver": "Required for Selenium usage",
"ext-memcached": "Allows to use memcached as cache type",
"ext-sqlite3": "Allows to use an SQLite database for caching",
"ext-zip": "Required for FDroidRepoBridge",
"ext-intl": "Required for OLXBridge"
},
Feedback:
I'm no developer, so I can only test and provide feedback :)
Per the front page of this project, you should be able to download the zip, extract to your web server and execute. (That's what I did). If there are dependencies, (e.g., the php-webdriver folder), then suggest it is added to the project. Otherwise, best to shift to a docker-only model where all of this could be handled automatically?
As I was attempting to create a bridge, I wanted to use navigation within the browser (e.g., back button). To make that work, I add to add to bootstrap.ini: DIR . '/../vendor/php-webdriver/webdriver/lib/WebDriverNavigationInterface.php', DIR . '/../vendor/php-webdriver/webdriver/lib/WebDriverNavigation.php',
This will not be a fix for Cloudflare unfortunately, at least not initially. Cloudflare is able to detect selenium sessions and block.
Using the button "ATOM" (to create the feed) is broken and errors out. MRSS does work.
As stated in this thread, while there is caching at the feed-level, there is no caching at the lower levels (e.g., code of the page). This is impactful if you were to build a full-text rss feed.
Other: Videocardz site -- recently it has gone behind cloudflare for several regions in the world. While this won't address that issue, if you are able to access, I was able to do a full-text feed (perhaps useful as an example):
https://gist.github.com/dr0id123/0d14ff0e2a666fd6e8c949091c7dea45
Hope this helps!
This was recently merged in master so that users can test it out. Let's call it a beta testing phase. When things are looking good ill add instructions to README.
In the meantime, all that is needed to test this is composer require --update-no-dev php-webdriver/webdriver
.
I'd advise against manually modifying files in bootstrap.php
or the vendor
folder, unless you know what you are doing.
I think I have a way that presents a path forward (in some cases):
After looking around, I came across this project: https://github.com/FlareSolverr/FlareSolverr FlareSolverr uses Selenium combined with a modified driver to help bypass Cloudflare all wrapped up in docker. It's very hard to bypass Cloudflare and it will depend on the site, configuration, etc, as we know. That said, it will work depending on the site.
Here's my working example:
https://gist.github.com/dr0id123/2863a821b8bf35864f977a3eb9dface4
Using this method can be quite slow (FlareSolverr takes time get around protections), and expensive (perf) otherwise due to selenium. Hence the caching really helps and happy the rssBridge API provides for it.
I'm no coder, I'm sure you both could have done this in a better way.....
(Works quite nicely for me!!)
I'm no coder, I'm sure you both could have done this in a better way.....
That depends on how you look at it. In my opinion, this amounts to a game of cat and mouse in which both sides can only lose. I don't feel like investing energy into it. Web technologies work best when they are designed to be interactive. I can live with it if webmasters of modern websites no longer know what an RSS feed is. Then I'll just build one myself. But if someone sets up a walled garden, then I don't have to free the content from it at any price.
I added a path for active content sites because that's the modern web. But I have no interest in countermeasures for defensive strategies.
EDIT -- Am I wrong about "*does not** solve the issue of sites that you have created bridges and this interface for" -- I'm truly new to this so sorry for basic questions....
In the JSON, the full response is within it (e.g., content of the page article). Does this mean that using the JSON output from flaresolverr is a way to get the DOM and parse it accordingly to create a feed? Is this due to it using selenium and pulling directly out of the browser?
I added a path for active content sites because that's the modern web. But I have no interest in countermeasures for defensive strategies.
Totally understandable. It is literally a cat and mouse game with tools like Flaresolverr, which I may add does not solve the issue of sites that you have created bridges and this interface for.
Ultimately, I've come to conclude there is no "all in one" answer to all of this. rssBridge provides a menu of approaches to choose from depending on the site and need. The webdriver you've added to the project is another tool in the belt and a much needed addition.
@dvikan
I've taken some time to educate myself on 'dynamic DOM' and the complications that exist today with rssBridge. The work done in this pull request is awesome, but with many things code-related, there are multiple ways to get to the same objective. The implementation in this thread I think is awesome for advances cases. For other cases, I do think the project would benefit from an integration with Flaresolverr (or at least guidance with examples), not because it attempts some clouldflare mitigation, but because it uses selenium under the hood which means dynamic content is processed.
To give a clear example, using the 'scalable capital' example, here's is another way that does not require the use of php-webdriver and keeps things simple:
https://gist.github.com/dr0id123/17faa66b365f37546d93bd5f033759e3
yes this is similar to what i wrote earlier in this thread:
$dom = getContentsWithWebDriver('https://herman.bearblog.dev/blog/');
Essentially a function which returns html string (or parsed DOM) for a dynamic page.
FlareSolverr is a Proxy server to bypass Cloudflare protection, which incidentally parses a dynamic DOM using selenium as you write.
your example is actually also possible like this (using getContents
):
$header = ['Content-type: application/json'];
$opts = [CURLOPT_POSTFIELDS => json_encode($data_array)];
$html = getContents($url, $header, $opts);
@hleskien This is from a stackoverflow answer:
In
[PHP Selenium WebDriver](https://github.com/php-webdriver/php-webdriver)
you can get page source like this:
$html = $driver->getPageSource();
Or get HTML of the element like this:
// innerHTML if you need HTML of the element content
$html = $element->getDomProperty('outerHTML');
Thank you! This helps me to simplify elsewhere. gist updated.
This works:
$header = ['Content-type: application/json'];
$opts = [CURLOPT_POSTFIELDS => json_encode($data)];
$response_data = json_decode(getContents($api_url, $header, $opts), true);
$html = $response_data["solution"]["response"];
return $html;
$header = ['Content-type: application/json']; $opts = [CURLOPT_POSTFIELDS => json_encode($data_array)]; $html = getContents($url, $header, $opts);
how is this working for ya? is it still good?
@dvikan
Preface:
Feedback:
I am not specifically using webdriver at this point for a couple of reasons:
Therefore, my suggestions to this project:
$html = getContentsWithFlareSolverr('https://herman.bearblog.dev/blog/');
Hope this helps.
Just found this Issue and have some observations to share.
I did my best to write a Bridge for https://runrepeat.com, but due to its dynamic javascript nature, and their forced use of a "Recommended" article auto-sort I found my way to the php-webdriver extension you developed here.
Things work rather well, using the browser to navigate the site takes a while though, and I had to adjust my RSS clients timeout accordingly. I also impose a rather low article scrape limit:
My GitHub forks RunRepeatBridge
Sometimes there are connection issues to the Selenium instance, you can see all the different kinds of errors I catch at the end of collectData(). Maybe I will add some more, this has only been in operation for a couple of days.
I am not sure how this would scale though, if integrated natively into the project. People would have to run their own selenium-grids, and my single chrome-instance already takes about 550mb of RAM. This instance is then occupied when scraping is in progress, for about 60-90s with my low limits. If someone had multiple users, with different settings scraping, you'd have to host your own selenium grid just for that.
So... I like that it's there for enthusiasts, but I don't think it scales well.
Hello everyone,
A few weeks ago I came across the well-known problem that rss-bridge doesn't work for some websites. This is always the case when the website loads some content via XMLHttpRequest (XHR) and / or changes it using JavaScript. After some experimentation and thinking about the problem, I came to the conclusion that it can only be solved using a web browser. Coincidentally, I already had a lot of experience with Selenium and the WebDriver and decided to come up with a solution.
I now suggest that the WebDriverAbstract class be imported into the project, temporarily in its own branch. As examples I have written two bridges that use these.
The following problems still need to be solved:
composer.lock
because the changes look random and I didn't want to invest time (yet) to understand composer.bootstrap.php
, but they are not complete. They look messy and unsorted because of the dependencies between them. I suggest to use the autoloader.Kind regards,
Holger