Improve "bot-detection" evasion techniques

stevenengland commented 9 months ago

I am opening this issue because https://github.com/dgtlmoon/changedetection.io/discussions/1979 was deleted apparently without a final state (rejected/accepted).

Obersvation: Pure changedetection (no paid services, no proxies in place, ...) is more and more incapable of scraping websites. Because sites have the ability to detect that CD is a bot crawling the site. There are potential counter measures out there that need to be evaluated. More details were in the discussion mentioned.

If this feature request will be rejected it automatically means, that users will be inreasingly be forced to use paid services. Which is fine but I would lile to know where this repository is heading to.

Thanks in advance :)

So there are four sides to this

The browser side fingerprint (useragent, sec-ua user agent header, and all the other GPU card fingerprinting ettc tcetc) (note - setting just the 'user-agent' is deprecated and replaced by SEC-UA headers and internal JS navigator object in modern browsers!! https://filipvitas.medium.com/how-to-set-user-agent-header-with-puppeteer-js-and-not-fail-28c7a02165da)
"smart" work arounds for getting around Cloudflare (with some head-ful browser that first grabs the right pass-through-cookies etc)
The fingerprint of the actual TCP/IP connection, look up JA3 https://github.com/LyleMi/ja3proxy
Final extra bit - the reputation of your IP address

dgtlmoon commented 9 months ago

thanks @stevenengland , it was deleted because of the pushy disrespectful comments from a a some people using this software that we all love

I want remind everyone that the software is fully opensource and i'm open to any new PR's that you want to submit, if you want

However being a bully towards myself, demanding things from myself, when I dont know you, when I'm donating my free software to help you in your daily life, when i'm providing free support in my own time right here on github will be absolutely not tolerated and you will be banned from posting in this project on github

dgtlmoon commented 9 months ago

On the topic of "IMPLEMENT XYZ PLUGIN THAT I FOUND ON GITHUB!!"

I'm open to it - but i've tried all those plugins and I am unable to see any improvement in reducing error rates when you use the same IP address

please, try to think logically about it and find some way to prove to me that XYZ plugin reduces error/access rates other than just making demands that I do something for you, for free, without any evidence that it's going to help

is more and more incapable of scraping websites. Because sites have the ability to detect that CD is a bot crawling the site.

it is not directly changedetection's fault, the anti-robot (yes remember you ARE USING A ROBOT) protection across the internet is getting stronger and stronger, and companies are investing hundreds of $millions$ into detecting automated browsers (robots), and I am just one free software project on the internet

please remember this

dgtlmoon commented 9 months ago

Part 3, please remember that BrightData are the leading proxy providers who have also invested millions of USD into solving the browser fingerprint problem, if the site is so important to you then you really should - for now - consider their offers

Scraping Browser is the most powerful way

https://changedetection.io/tutorial/using-bright-datas-scraping-browser-pass-captchas-and-other-protection-when-monitoring

Following by Residential Proxies (not cheap datacentre proxies!)

https://brightdata.com/integration/changedetection

once again - BrightData have spent millions of USD solving this problem, and companies like CloudFlare have also invested 100's of millions of USD into blocking robots such as changedetection

Please remember this. please do not just make random demands that I spend my own personal time, for you, for free to implement some project that you found on google - that you most likely do not understand, with zero evidence that the project may or may not help

stevenengland commented 9 months ago

Hi again, no offense, I stepped into the thread late and I hope that I am not the person you recognized to be pushy because that was not my intention. And because I know the thread I must say, I personally also did not find the other comments really pushy (also not thaaat nice but also not pushy or rude) but of cause you may feel differently when reading them.

Anyway: Let me rephrase my intention: I thought that it is one of the main goals of the project to also provide "stealth-ability". And if so I wanted to remark, that this goal can't be achieved for more and more sites anymore except if you use paid services out there. If "stealth-ability" is not a goal it is fine as well as it is fine if you say you do not have the resources for it. But in the sense of a feature request: I just want to know if you do not want to follow the path or just don't have the resources. I understood: You would be open for this but don't have the resources, would appreciate PRs. So the FR here could be left open?

dgtlmoon commented 9 months ago

I understood: You would be open for this but don't have the resources, would appreciate PRs. So the FR here could be left open?

yes, but only where you can prove that the PR helped you, without changing your IP address

stevenengland commented 9 months ago

That would be my requirement as well. Because there are bot detections out there that block my request from behind a dynamic IP at the very first attempt crawling a site with CD. Whereas from a real Browser behind this dynamic IP all subsequent calls of the page are succeeding. So there ist clearly a way of fingerprinting CD browsers without even using the IP information.

dgtlmoon commented 9 months ago

mentioned here a long time ago, but havent got a PR https://github.com/dgtlmoon/changedetection.io/issues/1930

stevenengland commented 9 months ago

Looks interesting. Thanks for the hint.

jlhjlh commented 9 months ago

I've noticed this too. I'm getting banned using multiple different IPs so it is fingerprinting the CD browser somehow.

If there's a way for me to help, I'd be happy to.

Perhaps switching my Playwright container to use something like this? https://github.com/CheshireCaat/playwright-with-fingerprints

unixfox commented 9 months ago

I just used my own anti bot bypass solution: https://github.com/unixfox/pupflare And it works flawlessly with changedetection.

The only issue is that the HTML code given to the client is a bit broken so you get a page without CSS.

dgtlmoon commented 9 months ago

Perhaps switching my Playwright container to use something like this? https://github.com/CheshireCaat/playwright-with-fingerprints

This is JS-only :( so its not possible to use it - the solution needs to be something that can work with python, I already started this port https://github.com/dgtlmoon/pyppeteerstealth of the existing https://www.npmjs.com/package/puppeteer-extra-plugin-stealth project that will work with changedetection.io

please read the https://github.com/dgtlmoon/pyppeteerstealth page

the other part that is NOT yet solved is the JA3 fingerprinting of the actual TCP connection behaviour of the operating system and browser......

So there are four sides to this

The browser side fingerprint (useragent, sec-ua user agent header, and all the other GPU card fingerprinting ettc tcetc)
"smart" work arounds for getting around Cloudflare (with some head-ful browser that first grabs the right pass-through-cookies etc)
The fingerprint of the actual TCP/IP connection, look up JA3 https://github.com/LyleMi/ja3proxy
Final extra bit - the reputation of your IP address

addicted-ai commented 9 months ago

This is something bypasses Cloudflare challenges. Works pretty much fine with Jackett https://github.com/FlareSolverr/FlareSolverr/blob/master/src/flaresolverr_service.py#L252

jlhjlh commented 9 months ago

I actually use that now with my arrr apps. Thanks for your comment but I don’t think that’s going to work in this use case.

On Sun, Feb 25, 2024 at 4:13 AM Ashutosh Prusty @.***> wrote:

This is something bypasses Cloudflare challenges. Works pretty much fine with Jackett

https://github.com/FlareSolverr/FlareSolverr/blob/master/src/flaresolverr_service.py#L252

— Reply to this email directly, view it on GitHub https://github.com/dgtlmoon/changedetection.io/issues/2198#issuecomment-1962866322, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRLEACRVCWWZT7HEHGHYBTYVL6B7AVCNFSM6AAAAABDL3GHMKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRSHA3DMMZSGI . You are receiving this because you commented.Message ID: @.***>

iG8R commented 8 months ago

Hello. In the following discussion https://github.com/ultrafunkamsterdam/undetected-chromedriver/issues/1388#issuecomment-1823730675 regarding the CloudFlare detection such a project was mentioned https://github.com/g1879/DrissionPage. Maybe it's worth taking a look at it?

PS. Last week was horrible - almost all my watches stopped working due to the CloudFlare captcha:( I tried working through both Chromedriver and Playwright, but no luck so far.

PPS. BTW, please, correct the following on https://github.com/dgtlmoon/changedetection.io/wiki/Playwright-content-fetcher

Docker Compose based In docker-compose.yml uncomment PLAYWRIGHT_DRIVER_URL under environment, and the playwright-chrome section under services.

to

Docker Compose based In docker-compose.yml uncomment environment and PLAYWRIGHT_DRIVER_URL under it, and the playwright-chrome section under services.

siparker commented 7 months ago

one thing to add to this. not sure if there is a way to do it automatically but whenever i had issues with cloudflare blocking any bots i used to find the direct ip of the server behind cloudflare and put that into the local hosts file so it went direct to the end server and not via cloudflare. im sure this is potentially blocked now in some form or another but works for some sites quite well still for me.

jlhjlh commented 7 months ago

How did you find the direct IP?

On Tue, Apr 16, 2024 at 10:00 AM siparker @.***> wrote:

one thing to add to this. not sure if there is a way to do it automatically but whenever i had issues with cloudflare blocking any bots i used to find the direct ip of the server behind cloudflare and put that into the local hosts file so it went direct to the end server and not via cloudflare. im sure this is potentially blocked now in some form or another but works for some sites quite well still for me.

— Reply to this email directly, view it on GitHub https://github.com/dgtlmoon/changedetection.io/issues/2198#issuecomment-2059168488, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRLEAH3MEBSDOXQYM32K3LY5UVGTAVCNFSM6AAAAABDL3GHMKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJZGE3DQNBYHA . You are receiving this because you commented.Message ID: @.***>

drabgail commented 6 months ago

dgtlmoon great project first and foremost.

For everyone else, I switched over to sockpuppeteer and had 10 or more watches give me varying 4** errors, I guess it is profiled much easier but I just pulled headers from my actual browser to include on any watches which were failing. Changing headers has solved them all. I am not requesting any same url less than 10 minutes though. You definitely want proxies if you are hammering them.

iG8R commented 6 months ago

@drabgail Could you please elaborate what sockpuppeteer you use?

drabgail commented 6 months ago

@drabgail Could you please elaborate what sockpuppeteer you use?

Just the one from the docker compose on the changedetection repo: dgtlmoon/sockpuppetbrowser:latest When provided with headers it works on just about everything..

I was using browserless/chrome:latest before. It worked better, didn't need any additional tweaks and was almost never flagged. I started getting 'websocket closed' errors which I couldn't debug last week and noticed the repo was showing a different container to use so switched.

iG8R commented 6 months ago

@drabgail Thanks a lot!

dgtlmoon commented 6 months ago

If you can paste which headers+values you used that solved the access problems, that would be super nice!

On 24 May 2024 15:13:48 UTC, iG8R @.***> wrote:

@drabgail Thanks a lot!

-- Reply to this email directly or view it on GitHub: https://github.com/dgtlmoon/changedetection.io/issues/2198#issuecomment-2129790918 You are receiving this because you commented.

Message ID: @.***>

drabgail commented 6 months ago

I went here on my browser (just current version of edge, don't judge)... https://www.supermonitoring.com/blog/check-browser-http-headers/

...and copied whatever I got for these headers: Accept: User-Agent: Content-Type: Upgrade-Insecure-Requests: 1 Sec-Ch-Ua-Platform: Sec-Ch-Ua-Mobile: Sec-Ch-Ua: Cache-Control: max-age=0 Accept-Encoding:

I'll give you my exact headers if you want but it's probably best to keep the variation and have people user whatever their setup provides as this is more a 'real user' representation. If you're planning to incorporate this more than just the same headers then maybe a 'copy my headers' button somewhere on either the main settings or per watch request settings?.. The browser would of course know them.

siparker commented 5 months ago

How did you find the direct IP?

I would search dns changes for when they activated cloudflare. search for any subdomains that might be on same server but cloudflare is not active for. ftp. dev. cpanel. webmail.

if its wordpress there was a pingback technique you could use for the website to reveal its ip also. ill try and find the info on that if i still have it saves somewhere.

just a few examples.

amdjml commented 1 month ago

I went here on my browser (just current version of edge, don't judge)... https://www.supermonitoring.com/blog/check-browser-http-headers/

...and copied whatever I got for these headers: Accept: User-Agent: Content-Type: Upgrade-Insecure-Requests: 1 Sec-Ch-Ua-Platform: Sec-Ch-Ua-Mobile: Sec-Ch-Ua: Cache-Control: max-age=0 Accept-Encoding:

I'll give you my exact headers if you want but it's probably best to keep the variation and have people user whatever their setup provides as this is more a 'real user' representation. If you're planning to incorporate this more than just the same headers then maybe a 'copy my headers' button somewhere on either the main settings or per watch request settings?.. The browser would of course know them.

Can you tell me how you added these or where you added these headers?

dgtlmoon / changedetection.io

Improve "bot-detection" evasion techniques #2198