RSS-Bridge / rss-bridge

The RSS feed for websites missing it
https://rss-bridge.org/bridge01/
The Unlicense
7.34k stars 1.03k forks source link

Bypassing cloudflare js challenge idea #2547

Closed em92 closed 1 year ago

em92 commented 2 years ago

Instead of trying to retrieve content directly from cloudflare protected server, RSS-Bridge could retreive from other server (proxy?) that bypasses cloudflare.

How does that server work? After accepting HTTP requests it will interact with web browser (using Selenium or something similar, that I am not familiar with), that simply visits that page. When trying to visit, cloudflare will do some javascript magic and redirect to requested page.

Ping @sysadminstory as bridge maintainer, that is probably fed up with cloudflare protections.

Related issues: https://github.com/RSS-Bridge/rss-bridge/issues/1873 https://github.com/RSS-Bridge/rss-bridge/issues/1925 https://github.com/RSS-Bridge/rss-bridge/issues/2510 (ping @Roliga) https://github.com/RSS-Bridge/rss-bridge/issues/1779 (ping @csisoap) and some others

What do you think about it?

em92 commented 2 years ago

Here is one of the implemenations, that I found but did not check in practice https://github.com/unixfox/pupflare

em92 commented 2 years ago

Here is one of the implemenations, that I found but did not check in practice https://github.com/unixfox/pupflare

Tested. Does not work

triatic commented 2 years ago

Which proxy server is suitable to use? These Cloudflare protections are at least IP address based, and while a proxy might be a solution, I suspect Cloudflare will add those to the IP blocks.

sysadminstory commented 2 years ago

If we find a reliable solution, I'll be the happiest maintainer in the world :laughing:

dear-clouds commented 2 years ago

How about https://github.com/FlareSolverr/FlareSolverr ?

em92 commented 2 years ago

How about https://github.com/FlareSolverr/FlareSolverr ?

Yes, I checked, that it works at least for one query. Haven't integrated it with RSS-Bridge yet.

logmanoriginal commented 2 years ago

Just as a heads-up, here is an old PR with a good reason why cloudflare-bypass was not implemented in rss-bridge https://github.com/RSS-Bridge/rss-bridge/pull/725

TLDR;

Should RSS-Bridge use any means possible to access data ? That's something to consider carefully because at some point it can legally constitute an "unauthorized intrusion into an information system". -- https://github.com/RSS-Bridge/rss-bridge/pull/725#issuecomment-399789055

somini commented 2 years ago

Just a heads up, integrating this modified CURL library seems to work for now.

https://github.com/lwthiker/curl-impersonate

It emulates the TLS characteristics of a regular browser, so they cannot trivially detect this is not a regular human doing the request.

What I tested so far:

This seems to work, but the hard part is deploying this. You need to setup those environment variables on the entire process that runs PHP, that's usually the server. Doing that is left as an exercise to the reader.

em92 commented 2 years ago

As @VerifiedJoseph mentioned in https://github.com/RSS-Bridge/rss-bridge/pull/2599#issuecomment-1086931230, there are 3 challenges by cloudflare:

Example of each: https://verifiedjoseph.com/js-challenge https://verifiedjoseph.com/legacy-captcha-challenge https://verifiedjoseph.com/managed-challenge

I edited topic to mention, which exact challenge I am talking about.

em92 commented 2 years ago

Also noting, that this feature became low priority for me, since my PikabuBridge started working again without implementing cloudflare bypassing feature.

dugite-code commented 2 years ago

For anyone looking into using FlareSolverr Here is a diff of the NineGagBridge showing how I've utilized it after getting locked out by cloudflare. Just a warning It's not performant in any way, but as it's used by my RSS reader it shouldn't prove to be an issue

https://github.com/RSS-Bridge/rss-bridge/blob/master/bridges/NineGagBridge.php#L139

                $cursor = 'c=10';
+               $header = array('Content-Type: application/json');
                for ($i = 0; $i < $this->getPages(); ++$i) {
+                      $payload = json_encode(array(
+                          'cmd' => 'request.get',
+                          'url' => $url . $cursor,
+                          'maxTimeout' => 10000,
+                          )
+                       );
+                      $opts = array(  CURLOPT_POSTFIELDS      => $payload,
+                           CURLOPT_RETURNTRANSFER  => true,
+                           CURLOPT_POST            => true,
+                           CURLOPT_TIMEOUT         => 10000,
+                       );
-                       $content = getContents($url . $cursor);
+                       $content = getContents('http://flaresolver:8191/v1', $header, $opts);
+                       $response = json_decode($content, true);
+                       preg_match('/<pre>(.*?)<\/pre>/s', $response['solution']['response'], $match);
+                       $body = strip_tags($match[0]);
-                       $json = json_decode($content, true);
+                       $json = json_decode($body, true);
                        $posts = array_merge($posts, $json['data']['posts']);
                        $cursor = $json['data']['nextCursor'];
                }

*Edit very not performant, I keep going past the maxTimeout a request can go past 20-30 seconds. Depending on your system this will need tuning. Also in the example above I moved the $payload into the loop for pagination

Bockiii commented 2 years ago

@dugite-code cant the same be achieved with the proxy setting?

dugite-code commented 2 years ago

@Bockiii No, unfortunately FlareSolver isn't exposed as a standard proxy that curl can use and needs the json payload specifying the request, url and max timeout.

dugite-code commented 2 years ago

So it turns out that shockingly (/s) FlareSolver is much more performant if you utilize it's sessions function. Still slower than a regular curl request but it'll do for my use case.

public function collectData() {
        $url = sprintf(
                '%sv1/group-posts/group/%s/type/%s?',
                self::URI,
                $this->getGroup(),
                $this->getType()
        );
        $cursor = 'c=10';
        $posts = array();
+        $session = "9gag" . $this->getGroup();
+        $header = array('Content-Type: application/json');
+        $payload = array(
+                'cmd' => 'sessions.create',
+                'session' => $session,
+                );
+        $opts = array(  CURLOPT_POSTFIELDS      => json_encode($payload),
+                CURLOPT_RETURNTRANSFER  => true,
+                CURLOPT_POST            => true,
+                CURLOPT_TIMEOUT         => 10000,
+        );
+        // Create Session
+        getContents('http://flaresolver:8191/v1', $header, $opts);
+        $payload['cmd'] = 'request.get';
+        $payload['maxTimeout'] = 10000;
        for ($i = 0; $i < $this->getPages(); ++$i) {
+                $payload['url'] = $url . $cursor;
+                $opts[CURLOPT_POSTFIELDS] = json_encode($payload);
+                $content = getContents('http://flaresolver:8191/v1', $header, $opts);
-                $content = getContents($url . $cursor);
+                $response = json_decode($content, true);
+                preg_match('/<pre>(.*?)<\/pre>/s', $response['solution']['response'], $match);
+                $body = strip_tags($match[0]);
+                $json = json_decode($body, true);
-                $json = json_decode($content, true);
                $posts = array_merge($posts, $json['data']['posts']);
                $cursor = $json['data']['nextCursor'];
        }
+          $payload = array(
+                 'cmd' => 'sessions.destroy',
+                 'session' => $session,
+         );
+        $opts[CURLOPT_POSTFIELDS] = json_encode($payload);
+        // Destroy Session
+        getContents('http://flaresolver:8191/v1', $header, $opts);

        foreach ($posts as $post) {
                $item['uri'] = $post['url'];
                $item['title'] = $post['title'];
                $item['content'] = self::getContent($post);
                $item['categories'] = self::getCategories($post);
                $item['timestamp'] = self::getTimestamp($post);

                $this->items[] = $item;
        }
}

I'm thinking it may be possible to add this into the cloudflare exception however bridges that are expecting json like the NineGageBridge would need specific support added. Although because I had to set up a curl on crontab to regularly call the feed to fill the cache so my RSS reader doesn't time out I don't know how useful this would actually be. https://github.com/RSS-Bridge/rss-bridge/blob/0b40f51c01774d6b3ce5c7c9617dd1fbc2201128/lib/contents.php#L37

quickwick commented 2 years ago

I encountered something today that I think is relevant. My apologies if not.

We may be able to bypass at least some Cloudflare challenges by using HTTP/2 curl requests.

I was doing a test scrape of a site for a possible new bridge, and ran into a Cloudflare 403 error. My test was with Python, using the standard "requests" library. After a bit of googling, I found the suggestion to switch to the "httpx" library. The "httpx" library supports HTTP/2, which apparently deals with some forms of Cloudflare protection, whereas "requests" library only supports HTTP/1.1. I tried "httpx", and it worked! No more 403 Cloudfare error.

Some additional googling informed me that curl (at least, newer versions) supports HTTP/2. Perhaps ConnectivityAction.php could be updated to include the CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_20? Or is it better to implement that on a per-bridge basis? Now that I do a quick search, I see individual bridges are setting various CURLOPT* settings.

I suppose for this to work, it would also depend on the version of curl included in the PHP on the system where rss-bridge is running? Anyway, just putting it out there incase someone else wants to investigate further. I may try taking a poke at this myself, but my PHP skillz are still fairly rudimentary.

Here's a stackoverflow answer talking about PHP, curl and HTTP/2: https://stackoverflow.com/a/37146586

dugite-code commented 2 years ago

@quickwick Just gave CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_2_0 a shot for the 9gag bridge (HTTP2 is supported in the docker container), unfortunately it looks like it is not a solution. I suspect it depends on the level of protection that is selected in Cloudflare.

quickwick commented 2 years ago

I spent some a bit more time digging into this. First, I discovered that cURL uses CURL_HTTP_VERSION_2TLS as default since version 7.62.0. The PHP on the system I'm using to host rss-bridge has cURL 7.64.0, so that's not relevant to me. I tried it anyway on a rudimentary bridge for the site I'm testing, and got 403 Cloudflare error.

Some additional googling suggested that the way to avoid at least some Cloudflare challenges is to make your request look as similar to a real browser as possible. The recommendation was to load the site in the browser with Developer Tools open to the Network tab, then right-click on the main request and select Copy > Copy as cURL, and use those headers in your cURL request.

I tried that, and put a bunch of the headers into a $headers array that was passed into the getSimpleHTMLDOM call in my test bridge. Success! I was able to pull the HTML and dump it into the debug log, and it looks right.

My next goal is to figure out exactly which header values are necessary, and which can be eliminated. Thankfully it doesn't seem to need cookies or anything else liable to change frequently.

triatic commented 2 years ago

@quickwick did you use the same IP address to fetch the headers as you are using for rss-bridge?

quickwick commented 2 years ago

@triatic I did use the same IP. Is that relevant?

It appears that user-agent is one of, if not the only, relevant header to avoid the Cloudflare 403 error. The user-agent string is very long, and phpcs is complaining about it. I'm trying to figure out the most correct way to put that long string into the headers array. While testing changes, I broke that header entry, and started getting the Cloudflare 403 error again. Once the user-agent entry in the headers array was fixed, the Cloudflare error went away and I was successfully reading the site again.

triatic commented 2 years ago

@quickwick if you used the same IP address in both cases, then it would appear that the IP address is not relevant. I imagined cloudflare protections could blacklist things like cloud computing servers, in the same way that Facebook does.

somini commented 2 years ago

Bumping this with my solution that consists of using a modified cURL library.

Just a heads up, integrating this modified CURL library seems to work for now.

https://github.com/lwthiker/curl-impersonate

It emulates the TLS characteristics of a regular browser, so they cannot trivially detect this is not a regular human doing the request.

What I tested so far:

* Setup curl-impersonate, get the modified curl library. There are docker containers and AUR packages

* Run `export LD_PRELOAD=/usr/lib/libcurl-impersonate-firefox.so CURL_IMPERSONATE=ff91esr php -S localhost:8000`

This seems to work, but the hard part is deploying this. You need to setup those environment variables on the entire process that runs PHP, that's usually the server. Doing that is left as an exercise to the reader.

Bockiii commented 2 years ago

I can see that there is a bunch of movement in the curl-impersonate repo and I like the idea of "outsourcing" the problem and just using the solution that someone else is spending time on. After all, this project isn't focussed on bypassing cloudflare and it shouldn't become our main goal. But with cf becoming more and more prevalent, we need a solution.

So far, I like the curl-replacement the best. As far as rollout goes, this will be easy to implement in the docker container and will require additional documentation for raw-hosters.

triatic commented 2 years ago

Although curl-impersonate is a potential solution, I do not believe it is supported on Windows. Since this project is OS-agnostic, constructing bridges to use Windows-incompatible utilities would be a departure.

somini commented 2 years ago

Thanks for pointing that out. There's no direct incompatibility, just lack of Windows release managers, it seems.

This is being tracked here: https://github.com/lwthiker/curl-impersonate/issues/37

triatic commented 2 years ago

It's good that it's on their roadmap, albeit with seemingly a considerable amount of work still to do. That said, I think we would be neglecting Windows users by integrating curl-impersonate in its current state.

somini commented 2 years ago

True, since this is a cat-and-mouse game, this would be only one of the possibilities to solve this issue.

Bockiii commented 2 years ago

The problem is, every solution that we will do will probably have some challenges in making it available on all platforms. Most, if not all, require additional software like puppeteer or selenium or in the case of curl-impersonate, cant be rolled out on specific OS's.

What we could do: Make it configurable to try the cf bypass or not (whichever bypass we will select). This way we can prepare it in the docker container to "just work" and if you want to run it on bare metal, you will need to do some manual configuration (some of which we can provide as a guide, some will be gated behind OSes).

So example for the curl-impersonate version: Add a config flag "cf-workaround" or so, set it to true for the docker (as the use of the alternative curl version shouldnt impede the non-cf bridges at all, so it can be the default), add curl-impersonate to the docker image and preconfigure everything. Add a how-to for linux os's on what to do if they want to use this workaround. Windows users will not be able to use the workaround but can still use the normal functionality.

Same would go for selenium/puppeteer, but there may be other problems with that. puppeteer for example spawns a new chrome browser for each request by default, which would cause dozens or hundreds of sessions for some bridges alone. So we would need to do some digging on how to use it best. And then we would need to check if the usage of those tools is os-agnostic and just needs to be installed etc.

I'm still a fan of the curl-impersonate option as its the lowest footprint and easiest to implement from the get go since it's a drop-in replacement. Yes, it will not be able to use for windows users, but I am questioning how many people use windows as their host. Most/all online hosters are linux based, docker is, even Docker on windows is just a linux vm.

triatic commented 2 years ago

Docker on windows is just a linux vm.

Are you sure? If yes, then Windows would be able to run the Linux binary of curl-impersonate via Docker.

triatic commented 2 years ago

Come to think of it, Windows can run Linux binaries via WSL: https://docs.microsoft.com/en-us/windows/wsl/about

Bockiii commented 2 years ago

Jup, it is.

So even people using the docker image on windows will be able to just use the preconfigured curl-impersonate. Its only about people who download the rss-bridge source and put it into a WAMP stack or an IIS or something like that.

Thats why I said that I think that this is a small subset of users.

triatic commented 2 years ago

I've just tested it in Windows and after downloading the precompiled curl-impersonate binary in WSL, this:

wsl /home/triatic/curl-impersonate-chrome https://www.google.com

returns the page as expected. Seems like a viable solution, and easy to configure even without Docker.

Bockiii commented 2 years ago

@somini Have you tried to just "add" this to the official docker image and see how it goes?

I couldnt find anything about a drop-in-replacement in the source repo, it always talks about browser specific wrapper scripts. Would the solution just be an alias? like alias curl=curl_chrome101 or should we symlink curl to the wrapper?

You've done some work on this, I would be super interested in deploying this within the docker image (if it's a drop-in that wont affect the rest of the code).

I've seen some curlopt issues (like user-agents not being used) but I think it's okay for us, as our user-agent tweaking usually surrounds the whole idea of getting around CF :)

triatic commented 2 years ago

I believe we should continue to use the PHP curl module by default, and for specific bridges with issues around Cloudflare, we can call the curl-impersonate binary instead.

One solution is to give getContents() a 5th parameter called "impersonate" which defaults to false when not specified. When true, the code can call the curl-impersonate binary.

In addition, when the binary is missing from the installation (non Docker) and the bridge requires curl-impersonate, we could insert a notice in the RSS feed with some guidance, or otherwise somehow instruct the user. Or just drop back to the PHP curl module, unable to pull data but at least not completely broken.

somini commented 2 years ago

@somini Have you tried to just "add" this to the official docker image and see how it goes?

I couldnt find anything about a drop-in-replacement in the source repo, it always talks about browser specific wrapper scripts. Would the solution just be an alias? like alias curl=curl_chrome101 or should we symlink curl to the wrapper?

That's not how this works. PHP itself uses the curl library directly, it doesn't run the binary on the background.

curl-impersonate does support a drop-in replacement for the curl library, but it needs to be loaded before PHP has a chance to run in the first place. Note that the project releases binaries now, including for the library, here: https://github.com/lwthiker/curl-impersonate/releases.

I don't have much experience with Docker, I wouldn't know where to start. The important parts are:

This should use the curl-impersonate library instead of the regular curl library, and the browser to emulate is given on that other environment variable.

somini commented 2 years ago

I believe we should continue to use the PHP curl module by default, and for specific bridges with issues around Cloudflare, we can call the curl-impersonate binary instead.

We can't just call a different curl binary without rewriting the entire core of RSS Bridge. How it works now is PHP uses the curl library directly. Using the curl-impersonate library is a complete drop-in replacement, no need to support multiple curl libraries, that's way too much work for no benefit.

I think the PHP-curl integration is lower level than the PHP code can access even, it needs to be configured before the server starts up in the first place. That's why this will not work on shared PHP hosts, but on the Docker container it should be possible.

somini commented 2 years ago

I don't really have Docker experience, but I hacked something together on #2925. From what I see in the official php image, the command running is php-fpm, I think the ENV affect that program's environment variables.

Please review and test it.

triatic commented 2 years ago

We can't just call a different curl binary without rewriting the entire core of RSS Bridge.

I disagree, the 5th parameter solution for getContents() I gave earlier hardly qualifies as "rewriting the entire core". A minor tweak with a fallback mechanism. One of the advantages of bridges calling getContents() rather than pulling the data using their own custom functions.

Bockiii commented 2 years ago

If this is going to be mostly a docker only solution (as implementing the impersonate option is a bigger chore on other environements), whats the downside of just completely switching?

AFAIK replacing the lib will make php-curl use the replaced lib. Both should be functionally the same, the only changes seem to be in metadata. So whats the harm?

We should test if using the firefox impersonation or the chrome impersonation could lead to different outputs from bridges , but luckily enough , we already have testsuites in place that can do all of that.

somini commented 2 years ago

We can't just call a different curl binary without rewriting the entire core of RSS Bridge.

I disagree, the 5th parameter solution for getContents() I gave earlier hardly qualifies as "rewriting the entire core". A minor tweak with a fallback mechanism. One of the advantages of bridges calling getContents() rather than pulling the data using their own custom functions.

My point is that the current implementation of getContents uses the libcurl API directly, it doesn't use the curl command. So all the curl options currently set on the API would have to be passed as command arguments. That's what I mean be "rewriting the core", it's not easy to translate everything. I think some options might not even be exposed outside the API.

AFAIK replacing the lib will make php-curl use the replaced lib. Both should be functionally the same, the only changes seem to be in metadata. So whats the harm?

Yes, that's what I think too.

One other thing, the CURL_IMPERSONATE option is considered when the library is loaded, so you can't just switch it as an option to getContents or something. I actually double-checked, and it is possible to change this at runtime, using curl_easy_impersonate:

https://github.com/lwthiker/curl-impersonate#libcurl-impersonate

somini commented 2 years ago

I actually double-checked, and it is possible to change this at runtime, using curl_easy_impersonate:

https://github.com/lwthiker/curl-impersonate#libcurl-impersonate

In this case, I think it is possible to implement as a 6th parameter. Instead of setting the CURL_IMPERSONATE env variable, you call curl_easy_impersonate with the given option. The default could be not calling the function, which means it's vanilla curl.

dugite-code commented 2 years ago

Using the pullrequest submitted by @wrobelda is working quite well for me in most cases. No longer have to deal with the heavy and hacky FlareSolver.