campo1312 / DataDome

How to detect, block and manage DataDome.
Apache License 2.0
37 stars 17 forks source link

Not a bypass #6

Open sleeyax opened 2 years ago

sleeyax commented 2 years ago

This 'method' is not a bypass because the captcha is part of datadome and solving it is thus the regular way to get past it. Thus, the title should be 'How to get past it?' or 'How to solve it?' instead of 'How to bypass it?'.

Granitosaurus commented 2 years ago

yeah, this repo is misleading and dated. I've been doing a lot of research in this area and I'd like to share what I found:


To avoid anti-web scraping services like Datadome, we first should understand how they work, which really boils down to 3 categories of detection:

  1. IP address
  2. Javascript Fingerprint
  3. Request details

Services like Datadome use these tools to calculate a trust score for every visitor. A low score means you're likely to be a bot, so you'll either be requested to solve a captcha or denied access entirely. So, how do we get a high score?

IP Addresses / Proxies

For IP addresses, we want to distribute our load through proxies, and there are several kinds of IP addresses:

So, to maintain a high trust score, our scraper should rotate through a pool of residential or mobile proxies.

Javascript Fingerprint

This topic is way too big for a StackOverflow question, though let's do a quick summary

Websites can use Javascript to fingerprint the connecting client (the scraper) as javascript leaks an enormous amount of data about the client: operating system, support fonts, visual rendering capabilities, etc.

So, for example: if Datadome sees a bunch of Linux clients connecting through 1280x720 windows, then it can simply deduce that this sort of setup is likely a bot and gives everyone with these fingerprint details low trust scores.

If you're using Selenium to bypass Datadome, you need to patch many of these holes to get out of the low trust zone. This can be done by patching the browser itself to fake fingerprinted details like operating system etc.

For more on this, see my blog How to Avoid Web Scraping Blocking: Javascript

Request Details

Finally, even if we have loads of IP addresses and patch our browser from leaking key fingerprint details, Datadome can still give us low trust scores if our connection patterns are unusual.

To get around this, our scraper should scrape in non-obvious patterns. It should connect to non-target pages like the website's homepage once in a while to appear more human-like.


Now that we understand how our scraper is being detected, we can start researching how to get around that. Selenium has a big community and the keyword to look for here is "stealth". For example, selenium-stealth (and its forks) is a good starting point to patching Selenium fingerprint leaks.

Unfortunately, this scraping area is not very transparent, as Datadome can simply collect publicly known patches and adjust their service accordingly. This means you have to figure out a lot of stuff yourself or use a web scraping API to do that for you to scrape protected websites past the first few requests.

I've fitted as much as I can into this answer so for more information see my series of blog articles on this issue How to Scrape Without Getting Blocked