berstend / puppeteer-extra

💯 Teach puppeteer new tricks through plugins.
https://extra.community
MIT License
6.23k stars 732 forks source link

[Bug] Sites with hard challenges of CloudFlare do not work with `browser.newPage()` and call `browser.pages()` Disrupted them tabs! #832

Open NabiKAZ opened 10 months ago

NabiKAZ commented 10 months ago

The site ‍‍‍‍https://www.000webhost.com/cpanel-login has a hard challenge for Cloudflare and does not open normal. So I used the puppeteer-extra-plugin-stealth plugin.

This site will not be opened with the browser.newPage(). (The tick we hit again the same challenge page) But in the first default tab, which is always open, this site opens! (We tick and the site opens.) This is strange so far, but it gets strange.

So I tried to use the same tab without newPage(). I tried to get the pages first: var pages = await browser.pages(); Then open the site in the first tab with pages[0].goto. But this time the site didn't open! (I mean fails to dissolve challenge Cloudflare)

It looks like it doesn't open when I call ‍newPage(). And also when the pages() method is called, all tabs for this site are disrupted. (Even the first tab that can normally open this site)

I was confused and I think there's a bug here.

Sample code:

import puppeteerExtra from 'puppeteer-extra';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';

var puppeteer = puppeteerExtra.use(StealthPlugin());

var browser = await puppeteer.launch({ headless: false });
// const page = await browser.newPage();

var pages = await browser.pages();
pages[0].goto('https://www.000webhost.com/cpanel-login');

Versions:

node v19.7.0
puppeteer@21.1.1
puppeteer-extra@3.3.6
puppeteer-extra-plugin-stealth@2.11.2
Chrome Version 116.0.5845.141

Video:

https://github.com/berstend/puppeteer-extra/assets/246721/f527b517-5fb5-4941-8c81-1bd37d9b1046

mowatermelon commented 10 months ago

You can give it a try. Await puppeteer.launch ({userDataDir: path.join (os.homedir (), '.aaa-data'),} add a unified cache folder, then all tab pages that jump to the same address can share a cache.

NabiKAZ commented 10 months ago

@mowatermelon Thanks for your answer.

It wasn't bad as a temporary trick. But it has problems and of course the bug still exists.

By setting userDataDir, we can maintain the previous status and after solving the challenge manually, the next time Chrome opens, we can open the site without the challenge page.

But when the CloudFlare cookie session expires, everything goes back to the way it was before. If we use, for example, newPage(), the challenge will not be solved. Or if we reach the first tab with the help of pages()[0], everything is broken there and the site's challenge is not solved in any way.

Unless we repeat the trick again, i.e. temporarily remove the call to pages() and run Chrome once to pass the challenge. Then return that function to our code.

In general, it did not cure much pain!

NoeelGz commented 9 months ago

same problem :(

wlc108 commented 9 months ago

I'm experiencing the same thing on other sites. What seems to be happening from my analysis, is the initial tab is "untouched" by puppeteer. So I can do whatever I want in the initial tab and I'm not detected. If I have puppeteer open it's own tab, and I manually take all actions in that tab, then I get detected as a bot. Alternatively, if I have puppeteer make the initial tab active then I take manual action, I'm detected as a bot.

So it seems any tab that gets "touched" by Puppeteer, gets detected somehow. When I look at Browser Fingerprinting, this seems to be the case. I'm not sure what they're using to detect Puppeteer even with stealth enabled.

NodePuppeteer commented 9 months ago

creepjs (https://abrahamjuliot.github.io/creepjs/) is detecting puppeteer when you use any tab besides the start-up tab.

image

Take note that 205 lies are being detected, here's a sample: image

I'm launching with the latest version of Google Chrome(Version 117.0.5938.92 (Official Build) (64-bit)) on Windows10 and am using the latest Puppeteer Node.js version alongside the stealth plugin with all evasions active.

Another creepjs image: image

joeledwardson commented 9 months ago

Any ideas about how Cloudflare is detecting Puppeteer?

I don't know the ins and outs of Puppeteer but I know they use the chrome dev tools protocol, the same as chrome dev tools.

However if I open https://nowsecure.nl/ with chrome dev tools open I can get through fine?

NabiKAZ commented 9 months ago

Unfortunately, this problem was very serious and acute. And I had to use the service FlareSolverr. This is a proxy to bypass cloudflare and they use selenium. I just send my first page to it and return only the cf_clearance cookie, set it to my puppeteer and continue...

wlc108 commented 9 months ago

Puppeteer can be detected by https://abrahamjuliot.github.io/creepjs/ if that helps. It's not just that it detects a bot, it detects PUPPETEER.

pup
ergcode commented 9 months ago

Unfortunately, this problem was very serious and acute. And I had to use the service FlareSolverr. This is a proxy to bypass cloudflare and they use selenium. I just send my first page to it and return only the cf_clearance cookie, set it to my puppeteer and continue...

For now this is a solution with no alternatives. Puppeteer or playwright + any anti-detection methods cannot solve problems with cloudflare.

I'll make a clarification. If your ip or proxy is not on the list of suspicious ones, then there is an option to get challenge v1, which can be completed without a click and with the launch command await puppeteer.launch({ targetFilter: (target) => !!target.url() }); But if your ip is suspicious, then targetFilter: (target) => !!target.url() blocks work with the iframe and the possibility of manipulating the challenge checkbox disappears.

joeledwardson commented 8 months ago

How is it that FlareSolverr works? Surely if Puppeteer is detected by Cloudflare then so would Selenium

ergcode commented 8 months ago

How is it that FlareSolverr works? Surely if Puppeteer is detected by Cloudflare then so would Selenium

FlareSolverr uses undetected-chromedriver. UC uses completely different detection bypass methods, which are still difficult to replicate in puppeteer and puppeteer-extra.

zfcsoftware commented 8 months ago

Hello, https://www.npmjs.com/package/cloudflare-scraper With this bookshelf you can scrape and get cookies.

ergcode commented 8 months ago

Hello, https://www.npmjs.com/package/cloudflare-scraper With this bookshelf you can scrape and get cookies.

Unfortunately, they have not yet implemented proxies. But this project can be copied and a proxy can be added.

joeledwardson commented 8 months ago

Any updates on this? Does anyone understand how creepjs detects Puppeteer and what could be done to avoid detection from the Puppeteer source itself?

ven0ms99 commented 7 months ago

Any update?

zfcsoftware commented 6 months ago

@Hillcow @joeledwardson @ergcode @NabiKAZ @wlc108 @mowatermelon

https://www.npmjs.com/package/puppeteer-real-browser

I had this problem too, so I had to find a solution and publish it. I found a way to make the browser look real. I connected to the browser with Puppeteer. I exported the browser and page created with Puppeteer. You can use all the functions you use with Puppeteer here. Proxy supported.

joeledwardson commented 6 months ago

This doesn't make sense to me, I am using puppeteer to connect to chrome on android just connecting through the dev tools port, and even pages not created by puppeteer, as soon as they are "touched" cannot get past cloudflare

zfcsoftware commented 6 months ago

This doesn't make sense to me, I am using puppeteer to connect to chrome on android just connecting through the dev tools port, and even pages not created by puppeteer, as soon as they are "touched" cannot get past cloudflare

I have tried the constant page refresh problem when trying to switch to Cloudflare on the aforementioned sites. I have been repeatedly signing up on a different site for about 10 hours and they use cloudflare premium captcha. It passed all of them without any problem without needing to touch it. I don't see any problem in the package right now except fingerprint. Could you please try the package? I tried it on many sites like 000webhost openai etc. It was successful on all of them.

joeledwardson commented 6 months ago

Unfortunately, this won't work for my use case as I launch chrome on android via ADB, forward the chrome dev tools port and then connect, so no launching is involved.

Hence why I am not sure this has fixed the issue, the only change I can see from the code in connecting is setting the user agent. If it works this is good news though

zfcsoftware commented 6 months ago

Unfortunately, this won't work for my use case as I launch chrome on android via ADB, forward the chrome dev tools port and then connect, so no launching is involved.

Hence why I am not sure this has fixed the issue, the only change I can see from the code in connecting is setting the user agent. If it works this is good news though

The important part to make it work is the cdp session which starts chrome on a port. We start a real browser on a port on the computer and connect with puppeteer's connect feature. The user agent is set to be the same as the created browser's agent. Here are some examples of it working: https://www.youtube.com/watch?v=OgQOaVNTPa4 https://www.youtube.com/watch?v=vfzEHsoJpuw https://www.youtube.com/watch?v=iTSVrtf8xXI

I will add linux support soon. If there is a site you want me to try, I can try it. In your case, after switching to cloudflare, the cookie can be taken and used with adb.

nhhoang commented 5 months ago

you guys can back to puppeteer version 5.5.0, all the problems will be solved, however, it quite old

ergcode commented 5 months ago

you guys can back to puppeteer version 5.5.0, all the problems will be solved, however, it quite old

Can I have your example package.json?

nhhoang commented 5 months ago

"puppeteer": "5.5.0", you should change to this version, I will share the detail problems later.

bn-l commented 5 months ago

"puppeteer": "5.5.0", you should change to this version, I will share the detail problems later.

Why that version specifically?

steinerx commented 5 months ago

@nhhoang Can you provide more information and tutorial on how you made it work? Thank you!

steinerx commented 5 months ago

"puppeteer": "5.5.0", you should change to this version, I will share the detail problems later.

Why that version specifically?

I can confirm that it works also on version "puppeteer": "^9.1.1". Anything higher than that does not work for me

smillwith61 commented 5 months ago

"puppeteer": "5.5.0", you should change to this version, I will share the detail problems later.

Why that version specifically?

I can confirm that it works also on version "puppeteer": "^9.1.1". Anything higher than that does not work for me

Does not work for me. Did you downgrade anything else too, like the stealth package or the puppeteer-core package?

steinerx commented 5 months ago

@smillwith61 After downgrading to "puppeteer": "^9.1.1", Do not use puppeteer-stealth package. Use the first tab to do your stuff.

let pages = await browser.pages();
let page = pages[0]
Mohamed3on commented 4 months ago

has anyone found solutions to this? I can't bypass captcha on cloudflare anymore

Kiaala6 commented 4 months ago

joeledwardson

Unfortunately, this won't work for my use case as I launch chrome on android via ADB, forward the chrome dev tools port and then connect, so no launching is involved. Hence why I am not sure this has fixed the issue, the only change I can see from the code in connecting is setting the user agent. If it works this is good news though

The important part to make it work is the cdp session which starts chrome on a port. We start a real browser on a port on the computer and connect with puppeteer's connect feature. The user agent is set to be the same as the created browser's agent. Here are some examples of it working: https://www.youtube.com/watch?v=OgQOaVNTPa4 https://www.youtube.com/watch?v=vfzEHsoJpuw https://www.youtube.com/watch?v=iTSVrtf8xXI

I will add linux support soon. If there is a site you want me to try, I can try it. In your case, after switching to cloudflare, the cookie can be taken and used with adb.

My case is that I am detected as bot when I run the command via puppeteer "await browser.pages() ;", i.e. whenever puppeteer starts touching the page. I have already used puppeteer-extra-plugin-stealth but still nothing changed.

It seems that your puppeteer-real-browser is not solving issue like mine?

nhhoang commented 4 months ago

I just asked Puppeteer about this. However, it doesn't help, you guys can check it here, pls help if you can find solution for it https://github.com/puppeteer/puppeteer/issues/11933

joeledwardson commented 3 months ago

I have asked about this before on the Puppeteer repository, but was told that this wasn't the goal of Puppeteer to avoid detection.

vladtreny commented 3 months ago

Cloudflare was debofuscated, a solution was found, but you will need to change many things in your code.

Temporary use targetFilter, it disables Cloudflare check, but does not solve the root cause.

        let flags = {
            headless,
            userDataDir,
           targetFilter: target => !!target.url()
        }

If cloudflare patches this, then will share a solution.

phapntm commented 2 months ago

Cloudflare was debofuscated, a solution was found, but you will need to change many things in your code.

Temporary use targetFilter, it disables Cloudflare check, but does not solve the root cause.

        let flags = {
            headless,
            userDataDir,
           targetFilter: target => !!target.url()
        }

If cloudflare patches this, then will share a solution.

i did targetFilter like this but still not working. always show Error: 600010. any idea please?

phapntm commented 2 months ago

joeledwardson

Unfortunately, this won't work for my use case as I launch chrome on android via ADB, forward the chrome dev tools port and then connect, so no launching is involved. Hence why I am not sure this has fixed the issue, the only change I can see from the code in connecting is setting the user agent. If it works this is good news though

The important part to make it work is the cdp session which starts chrome on a port. We start a real browser on a port on the computer and connect with puppeteer's connect feature. The user agent is set to be the same as the created browser's agent. Here are some examples of it working: https://www.youtube.com/watch?v=OgQOaVNTPa4 https://www.youtube.com/watch?v=vfzEHsoJpuw https://www.youtube.com/watch?v=iTSVrtf8xXI I will add linux support soon. If there is a site you want me to try, I can try it. In your case, after switching to cloudflare, the cookie can be taken and used with adb.

My case is that I am detected as bot when I run the command via puppeteer "await browser.pages() ;", i.e. whenever puppeteer starts touching the page. I have already used puppeteer-extra-plugin-stealth but still nothing changed.

It seems that your puppeteer-real-browser is not solving issue like mine?

my case just same with you. whenever run "await browser.pages() ;". The bot will be detected

vladtreny commented 2 months ago

can you show your code to reproduce? better in all in one file

AntonPolyakin commented 2 weeks ago

It helped me to rollback to puppeteer version ^9.1.1 and it worked for a long time, on any browser tabs simple Cloudflare validation was bypassed. But now the same problem has appeared on version 9.1.1. Can you tell me on which version the solution with targetFilter works ? Looks like I am facing this problem #https://github.com/puppeteer/puppeteer/issues/8772

ergcode commented 2 weeks ago

It helped me to rollback to puppeteer version ^9.1.1 and it worked for a long time, on any browser tabs simple Cloudflare validation was bypassed. But now the same problem has appeared on version 9.1.1. Can you tell me on which version the solution with targetFilter works ? Looks like I am facing this problem #puppeteer/puppeteer#8772

I can only advise: https://github.com/zfcsoftware/puppeteer-real-browser https://github.com/cyrus-and/chrome-remote-interface For a long time I tried to solve the puppeteer + proxy + cloudfalre problem. There are no significant results. 50% of the proxies pass the challenge, the rest receive a cyclical display of the “I’m not a robot” button.

If you only need to get cookies, then: https://github.com/zfcsoftware/cf-clearance-scraper https://github.com/FlareSolverr/FlareSolverr

AntonPolyakin commented 2 weeks ago

puppeteer-real-browser didn't help, it works exactly the same as regular puppeteer, not worth bothering with it. The solution with targetFilter is not a solution to the main problem at all, this method just skips loading pages that have redirects. That's what I was hoping for, but I misunderstood the users' posts above

ergcode commented 2 weeks ago

puppeteer-real-browser didn't help, it works exactly the same as regular puppeteer, not worth bothering with it. The solution with targetFilter is not a solution to the main problem at all, this method just skips loading pages that have redirects. That's what I was hoping for, but I misunderstood the users' posts above

Установите

docker run -d --log-driver json-file --log-opt max-size=10m --log-opt max-file=3 --name=cfclearance  -p 3000:3000 -e PORT=3000 -e browserLimit=20 -e timeOut=30000 --restart unless-stopped zfcsoftware/cf-clearance-scraper

docker run -d --log-driver json-file --log-opt max-size=10m --log-opt max-file=3 --name=flaresolverr -p 8191:8191 -e LOG_LEVEL=info --restart unless-stopped ghcr.io/flaresolverr/flaresolverr:latest

Отправьте запрос на cf-clearance-scraper с помощью got или request,

curl -L -X POST 'http://localhost:3000/cf-clearance-scraper' \
-H 'Content-Type: application/json' \
--data-raw '{
  "url": "https://www.domain.com/minecraft/",
    "proxy": {
        "host": "101.102.103.104",
        "port": "1234",
        "username": "login",
        "password": "password"
    }
}'

если он не решит, то шлите на flaresolverr

curl -L -X POST 'http://localhost:8191/v1' \
-H 'Content-Type: application/json' \
--data-raw '{
  "cmd": "request.get",
  "url": "https://www.domain.com/",
  "maxTimeout": 30000,
    "returnOnlyCookies": true,
    "proxy": {
        "url": "http://101.102.103.104:1234",
        "username": "login",
        "password": "password"
    }
}'

Полученные куки устанавливаете уже в свой браузер

const result = {
    "status": "ok",
    "message": "Challenge not detected!",
    "solution": {
        "url": "https://www.domain.com/",
        "status": 200,
        "cookies": [
            {
                "domain": ".domain.com",
                "expiry": 1752087165,
                "httpOnly": false,
                "name": "name1",
                "path": "/",
                "sameSite": "Lax",
                "secure": false,
                "value": "value1"
            },
            {
                "domain": ".domain.com",
                "expiry": 1718477565,
                "httpOnly": false,
                "name": "name2",
                "path": "/",
                "sameSite": "Lax",
                "secure": false,
                "value": "value2"
            }
        ],
        "userAgent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
    },
    "startTimestamp": 1718391160826,
    "endTimestamp": 1718391167716,
    "version": "3.3.19"
}

await page.setCookie(...cookies);

Проект cf-clearance-scraper построен на основе проекта puppeteer-real-browser. Так что есть смысл посмотреть как сделан обход CF.

AntonPolyakin commented 2 weeks ago

Я напишу на английском, чтобы больше людей могло подключится к обсуждению.

I tried cf-clearance-scraper, but it didn't help. I believe it can solve Cloudflare's captcha and that this library will be useful, but only after fixing the issue with the verification page looping.

I haven't tried FlareSolverr, as I would prefer to find a solution using NodeJS and Puppeteer. I'm not sure what the problem is, but it's definitely not related to puppeteer-extra, where this issue is discussed. It's likely not even related to Puppeteer itself. Once, reverting to an earlier version of Puppeteer helped me, possibly because earlier versions used older versions of Chrome. It might be worth testing different versions of Chrome and even other Chromium-based browsers. If I find out anything, I'll be sure to write about it.

vladtreny commented 2 weeks ago

Can you provide a small example to reproduce it? Maybe they updated detection.

Chrome team partially solved this problem, that cloudflare exploits. All these libs are trash.

I have a solution 😅 Will publish if nothing works. But you will need to rewrite a lot in your scripts.

AntonPolyakin commented 2 weeks ago

The easiest example is either in the very first post of this thread on the video, or this example: #https://github.com/xvrh/puppeteer-dart/issues/321#issue-2341667618 (My example without targetFilter) The problem is that even a human can't pass verification in the browser, it initiates an endless loop

vladtreny commented 2 weeks ago

In the example you provided https://nopecha.com/demo/cloudflare

The captcha is permanent. Regardless your setup. You need to solve it to bypass. We saw it before on other sites.

However, if human can't pass it, then they detect you

I modified the code and now pass it by click

        const stealth = StealthPlugin()
        stealth.enabledEvasions.delete('iframe.contentWindow')
        stealth.enabledEvasions.delete('media.codecs')
        puppeteer.use(stealth);
        (async () => {

            const browser = await puppeteer.launch({
                headless: false,
                // @ts-ignore
                targetFilter: target => !!target.url()
            })
            var page = await browser.pages()
            // @ts-ignore
            page = page[0]
            // @ts-ignore
            await page.goto('https://nopecha.com/demo/cloudflare')
        })()
AntonPolyakin commented 2 weeks ago

Unfortunately, it doesn't work for me puppeteer: 21.11.0 puppeteer-extra-plugin-stealth : 2.11.2 Google Chrome : 119.0.6045.105

I suppose that the solution to the problem should be found not in the libraries, but in the browser

vladtreny commented 2 weeks ago

What OS do you use?

try this

        const stealth = StealthPlugin()
        stealth.enabledEvasions.delete('iframe.contentWindow')
        stealth.enabledEvasions.delete('media.codecs')
        stealth.enabledEvasions.delete('user-agent-override')
        puppeteer.use(stealth);
        (async () => {
            const browser = await puppeteer.launch({
                headless: false,
                targetFilter: target => !!target.url()
            })
            var page = await browser.pages()
            page = page[0]
            await page.goto('https://nopecha.com/demo/cloudflare')
        })()
AntonPolyakin commented 2 weeks ago

Also, the option doesn't work. My OS: win10 64x

vladtreny commented 2 weeks ago

do you add any command line args? also provide your IP or at least first digits 1.2.3.x

AntonPolyakin commented 2 weeks ago

95.164.114.43 args: [ '--no-sandbox', '--disable-setuid-sandbox', '--disable-dev-shm-usage', '--disable-accelerated-2d-canvas', '--no-first-run', '--no-zygote', '--single-process', '--disable-gpu', '--ignore-certificate-errors', '--disable-backgrounding-occluded-windows', ],

vladtreny commented 2 weeks ago

Remove --single-process arg, cloudflare detects it.