berstend / puppeteer-extra

💯 Teach puppeteer new tricks through plugins.
https://extra.community
MIT License
6.55k stars 745 forks source link

Being detected by Distil Networks #33

Closed mjfeintuch closed 5 years ago

mjfeintuch commented 5 years ago

I am trying to automate logon to site that is using Distil. When I attempt to logon, I get a captcha. Capture

petrpatek commented 5 years ago

I think it might be related to the typing speed. Have you tried to lower the speed of typing? I would also try to implement random delays between actions like clicking the submit button after filling out the form.

bjesus commented 5 years ago

I'm being detected by Distil without even typing anything, just trying to see a website.

shirshak55 commented 5 years ago

@bjesus did u use puppetter stealth plugin or not. And by the way distill network can use many heuristics to detect bot so I don't think it is easy to find that heuristic.

bjesus commented 5 years ago

Hi @shirshak55, yes, I did use the stealth plugin of course, otherwise I wouldn't report it here :)

d0peCode commented 5 years ago

I have same issue.

shirshak55 commented 5 years ago

check if real browser has also same issue or not. They may be using ip to check bot activity guys. And did u installed any other extensions, additional fonts etc? And there is no chance for using proxy like that of luminati because distill networks has list of all proxies etc and they can easily identify u are using bot.

d0peCode commented 5 years ago

In normal browser it just display their site.

I'm not using any proxy. I didn't install any other packages. I even didn't go to https://www.distilnetworks.com/ programically but manually typing url in puppeteer chromium window. They instantly detect it somehow.

shirshak55 commented 5 years ago

@BorysTyminski using chromium?

shirshak55 commented 5 years ago

And do they detect one time only or each time. Because sometime due to new brand new fresh browser they may be suspicious . And save user data folder so they people u are same user.

d0peCode commented 5 years ago

Doesn't puppeteer use chromium? I set headless: false, and I just paste url and it instantly detected me. I didn't open this site never with puppeteer without stealth plugin so I doubt they saved me as a suspicious user.

shirshak55 commented 5 years ago

@BorysTyminski we can use chrome.

And if u are using chromium be sure to change user agent. which url give me i would like to test if it detects me or not.

d0peCode commented 5 years ago

https://www.distilnetworks.com/ let me know if they detect you as well maybe I'm doing something wrong. In my case it looks like this:

image

I dig a bit in their website source and I think this is their test which we are failing. However it is minified and probably it's just a bundle so it's hard to understand but this methods names are still meaningful.

Also on this site in the console in puppeteer I have some error which I don't have in normal browser:

VM226:14 Uncaught TypeError: getParameter is not a function at WebGLRenderingContext.getParameter (:14:18) at a.getWebglFp (zhrodsadknkfnugjasbebzzfzafscewueq.js:1) at a.webglKey (zhrodsadknkfnugjasbebzzfzafscewueq.js:1) at a.interrogate (zhrodsadknkfnugjasbebzzfzafscewueq.js:1) at zhrodsadknkfnugjasbebzzfzafscewueq.js:1

and warning:

The AudioContext was not allowed to start. It must be resumed (or created) after a user gesture on the page.

d0peCode commented 5 years ago

@shirshak55 did you check it? Do they detect you as well?

shirshak55 commented 5 years ago

they deected on that website only other not detected. it has to do with canvas probably finger printing issue

On Wed, May 29, 2019, 10:27 PM BorysTyminski notifications@github.com wrote:

@shirshak55 https://github.com/shirshak55 did you check it? Do they detect you as well?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/berstend/puppeteer-extra/issues/33?email_source=notifications&email_token=AB5Y4YJOW2RKLZYCFJT446LPX2W6BA5CNFSM4HEKUEAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWP5Y6A#issuecomment-497015928, or mute the thread https://github.com/notifications/unsubscribe-auth/AB5Y4YNSNLMQFDDZIF7AOM3PX2W6BANCNFSM4HEKUEAA .

d0peCode commented 5 years ago

What do you mean with:

canvas probably finger printing issue

??

I checked

document.createElement('canvas').getContext('webgl').getParameter(debugInfo.UNMASKED_VENDOR_WEBGL);

like this:

var canvas = document.createElement('canvas');
var gl = canvas.getContext('webgl');

var debugInfo = gl.getExtension('WEBGL_debug_renderer_info');
var vendor = gl.getParameter(debugInfo.UNMASKED_VENDOR_WEBGL);
var renderer = gl.getParameter(debugInfo.UNMASKED_RENDERER_WEBGL);

but vendor and renderer is just fine:

image

I found something interesting when following this error on this site:

image

shirshak55 commented 5 years ago

@BorysTyminski why not run on real browser and use devtool protocol to control it?

d0peCode commented 5 years ago

@shirshak55 so if I'll use chrome instead of chromium distill will not detect puppeteer? That's what you mean?

shirshak55 commented 5 years ago

@BorysTyminski try this https://github.com/shirshak55/scrapper-tools/blob/master/src/fastPage.ts#L81

And please open from other url protected by distil network to ensure.

Eastkap commented 5 years ago

Why not trying to solve the captcha?

shirshak55 commented 5 years ago

because page dont even load right?

On Mon, Jun 10, 2019, 3:07 AM Eastkap notifications@github.com wrote:

Why not trying to solve the captcha?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/berstend/puppeteer-extra/issues/33?email_source=notifications&email_token=AB5Y4YO3XWK2OSIQD6VMCZLPZVYCFA5CNFSM4HEKUEAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXISQPY#issuecomment-500246591, or mute the thread https://github.com/notifications/unsubscribe-auth/AB5Y4YNUGYODJ52B66Z5FOLPZVYCFANCNFSM4HEKUEAA .

d0peCode commented 5 years ago

How you want to solve google recaptcha with 100% efficiency without human ingeration? It's much easier to find how they detect webdriver than create such AI.

ahoura commented 5 years ago

Hey, just thought I would share my thoughts on this. having read about their web bot mitigation product on their site (which I assume is what is being discussed here), if they are not lying/exaggerating about the features of their products, this is how it works:

the JS SDK does "Hi-Def fingerprinting analyzes over 200 device attributes" which is then sent to and digested by distil networks backend, and this fingerprint is compared against the attributes your browser is expected to have (e.g. IE does not support web push notifications, however chromium does. so if you are setting your useragent to IE, you need to disable support for web push). so start with chromium masquerading as chrome (inject all the basic stuff like languages etc), or even point your puppetter to a chrome instance (should work almost the same way). see if they still detect you. I tried this method against google recaptcha V3 and it worked just fine.

shirshak55 commented 5 years ago

@BorysTyminski use 2captchas like this. https://github.com/shirshak55/scrapper-tools/blob/master/src/fastPage.ts#L34

sofdao commented 5 years ago

Hey, just thought I would share my thoughts on this. having read about their web bot mitigation product on their site (which I assume is what is being discussed here), if they are not lying/exaggerating about the features of their products, this is how it works:

the JS SDK does "Hi-Def fingerprinting analyzes over 200 device attributes" which is then sent to and digested by distil networks backend, and this fingerprint is compared against the attributes your browser is expected to have (e.g. IE does not support web push notifications, however chromium does. so if you are setting your useragent to IE, you need to disable support for web push). so start with chromium masquerading as chrome (inject all the basic stuff like languages etc), or even point your puppetter to a chrome instance (should work almost the same way). see if they still detect you. I tried this method against google recaptcha V3 and it worked just fine.

wonder what do you mean by "so start with chromium masquerading as chrome (inject all the basic stuff like languages etc), or even point your puppetter to a chrome instance (should work almost the same way)"? Are there any configuration outside of this + useragent to make puppeteer copy chrome? Thank you

shirshak55 commented 5 years ago

@keimao hey u can run real instance of chrome and grab the websocket url and use cdp :)

ahoura commented 5 years ago

Hey, just thought I would share my thoughts on this. having read about their web bot mitigation product on their site (which I assume is what is being discussed here), if they are not lying/exaggerating about the features of their products, this is how it works: the JS SDK does "Hi-Def fingerprinting analyzes over 200 device attributes" which is then sent to and digested by distil networks backend, and this fingerprint is compared against the attributes your browser is expected to have (e.g. IE does not support web push notifications, however chromium does. so if you are setting your useragent to IE, you need to disable support for web push). so start with chromium masquerading as chrome (inject all the basic stuff like languages etc), or even point your puppetter to a chrome instance (should work almost the same way). see if they still detect you. I tried this method against google recaptcha V3 and it worked just fine.

wonder what do you mean by "so start with chromium masquerading as chrome (inject all the basic stuff like languages etc), or even point your puppetter to a chrome instance (should work almost the same way)"? Are there any configuration outside of this + useragent to make puppeteer copy chrome? Thank you

I personally do not use this plugin directly, I use it as a reference mostly to make sure I am not missing something you guys have thought of. with that said, I dont believe this plugin can solve this issue all by itself (I could be wrong tho).

the way distilnetworks performs its checks is as follow: (this is how their main site's validation works, they might have different "delivery" methods of performing this validation but they will all check the same things at the end)

  1. you visit distilnetworks.com
  2. a verification page is loaded. in this page, you are given 10 seconds to pass their JS validation otherwise you are sent to the captcha page.

now digging a bit deeper into the JS, I stumbled upon audioKey:function(e){return this.options.excludeAudio?e:(e.audio=this.getAudio(),e)},getAudio:function(){var e=document.createElement("audio"),t=!1;return(t=!!e.canPlayType)&&(t=new Boolean(t),t.ogg=e.canPlayType('audio/ogg; codecs="vorbis"')||"nope",t.mp3=e.canPlayType("audio/mpeg;")

they are checking if the browser can play certain media types or not. the right answer depends on the user agent you are using, let say hypothetically chrome v71 cant play audio/ogg but v72.1 can... you need to make sure your browser's features match what is expected from the browser and its version and the above code is just a snippet of what they are checking.

so to answer your question

Are there any configuration outside of this + useragent to make puppeteer copy chrome?

it depends on what user agent you are using. thats why I suggested to point your puppeteer to a real chrome instance and dont lie about your useragent. if it works then you could try and change the temp directory to get a fresh instance of chrome (to avoid cookies and other stuff being shared across instances) and see if its still working. and based on your needs you can start lying about small things in your user agent.

d0peCode commented 5 years ago

I dont believe this plugin can solve this issue all by itself (I could be wrong tho).

@ahoura so why you think that this plugin can not bypass their antibot system? If I understood correctly your reply, we just need to match all chrome/chromium data (settings, window properties etc) with crafted user-agent. So currently something is inconsistent, for example (hypothetically) we have user-agent of chrome v71 and we can play audio/ogg and by detecting the inconsistence they lead us to recaptcha, right?

shirshak55 commented 5 years ago

@BorysTyminski actually this plugin has not been updated for like 4 months and its not bad to say that developer from distil network has already seen such type of plugin so they must already have knowledge about flaw.

So the best solution is use ur real browser and make it to do automation etc. Doing so will make them never know u are using real browser or bot. And u can also automate captcha with services like captchas or not.

And finger printing is a very good issue. Lets say u always use fresh instance then also distill network might know u are a bot because they may have system like somebody is coming from fresh instance(no cookies/sessions same canvas fonts etc.) always and from same ip so it should be bot although the browser looks like real chrome (not puppeteer chrome) etc..

And I think this bot detection has become like a cat and mouse game and I think in future there is no way distill network can detect the browser is bot or not :)

d0peCode commented 5 years ago

And I think this bot detection has become like a cat and mouse game and I think in future there is no way distill network can detect the browser is bot or not :)

Yes of course it's a cat-mouse game for a long time now.

So the best solution is use ur real browser and make it to do automation etc. Doing so will make them never know u are using real browser or bot. And u can also automate captcha with services like captchas or not.

I'll try it soon.

And u can also automate captcha with services like captchas or not.

I don't really want to pay for solving recaptcha, I know it's very cheap but my code is non-profit currently.

ahoura commented 5 years ago

@ahoura so why you think that this plugin can not bypass their antibot system? If I understood correctly your reply, we just need to match all chrome/chromium data (settings, window properties etc) with crafted user-agent. So currently something is inconsistent, for example (hypothetically) we have user-agent of chrome v71 and we can play audio/ogg and by detecting the inconsistence they lead us to recaptcha, right?

Yes thats the general idea, think of it as layers and they can always append to these layers of security. but from what I know and what I have seen in their JS, one of their primary checks is through comparing your browsers features vs the browser you claim to be. the reason I said 'I dont believe this plugin can solve this issue all by itself (I could be wrong tho).' is simply because of the way its been written. if you want to allow randomly generated UA, then you need to have a reference of what each browser is capable of, then inject these said features into the context and REMOVE all the other features to not raise any flags... this is difficult without having access to a lot of traffic to fingerprint the browsers in a similar manner that distilnetworks does, so you can lie to them without being detected. hope that explains it.

And I think this bot detection has become like a cat and mouse game and I think in future there is no way distill network can detect the browser is bot or not :)

I agree with the cat and mouse game, but I think in the future it will be a more advanced version of the same cat and mouse game. for example distil can digest your mouse movement and your behaviour on the page with others on the same page, and check if you have a "regular" behaviour on the page...

berstend commented 5 years ago

Fixed, please report if this issue still persists with the latest stealth version. :)

clickstefan commented 4 years ago

Seems it still persists, or the cat caught up with the mouse. e.g. I've used latest version and it gets detected after 4 requests.

shirshak55 commented 4 years ago

@clickstefan it can be ip and many stuff right? as it didn't get detected for 3rd request i think u should try with native chrome browser (w/o using bot) and report here?

clickstefan commented 4 years ago

My use case is simple, I just load the url and get the full html.

I think they don't immediately block as they are allowing some buffer to not block valid requests with slow js or slow internet.

I did load the site with a native browser, even tried using an in-browser crawler extension and all works fine, hence my suspicion of the headless browser/puppeteer fingerprint is being detected somehow.

I can confirm the IP is indeed what is being blocked, as changing it unblocks the requests.

clickstefan commented 4 years ago

Can read more about their ways here: https://www.imperva.com/products/bot-management/

Imperva collects and analyzes your bot traffic to pinpoint anomalies. Our machine learning models identify real-time bad bot behavior across our network and feed it through our known violators database. Biometric data validation, such as mouse movements, mobile swipe, and accelerometer data, catches malicious botnets. Rate limits based on device fingerprints — not IPs — provide further protection.

shirshak55 commented 4 years ago

try with native browser. If its ip issue then I don't think anybody can fix it unless u change the ip. You can purchase ip easily from external services anyway.

d0peCode commented 4 years ago

In matter of VPNs I can reccomand luminat.io for residential IPs easy to connect with puppeteer launch method.

shirshak55 commented 4 years ago

@d0peCode luminati is worst. They asked me show my face with credit card lol. It was terrible experience. And luminati provides proxy not vpn.

d0peCode commented 4 years ago

Uh i heard a lot good and I actually used it without trouble. Maybe stormproxies.com then. Yes it's proxies not VPN.

Bllacky commented 4 years ago

I find luminati to be very expensive compared to stormproxies.

clickstefan commented 4 years ago

Thanks for the advice, I don't think it is an IP issue, as same IP works fine in the browser, it's just that it detects the headless browser. If I find something I'll let you guys know. Best regards!

shirshak55 commented 4 years ago

@clickstefan it can be ip issue. When u run headless browser its empty no cookie etc. But when u try in normal browser it can identify cookie etc. So distil network might think this fingerprint looks like freshbrowser and this ip has already used 2-3 times similar fingerprint lets try to show captcha to ensure it etc. :)

ganchuhang commented 4 years ago

@clickstefan It's not about IP usually. DT detected the differences between normal browser and your virtual browser(puppeteer for this case).

I found that DT execute some javascript in our browser and send back to server. It's a long json, there are properties that different from the regular browser. Like my attachment below: image

What you can do is before your hit your target page, setup the puppeteer to match the regular browser properties. One of my mistake is the plugins property, previous I set it but wrongly like const pluginData ='something_not_null;

But instead it has to be like this: const pluginData = [ { name: 'Chrome PDF Plugin', filename: 'internal-pdf-viewer', description: 'Portable Document Format' }, { name: 'Chrome PDF Viewer', filename: 'mhjfbmdgcfjbbpaeojofohoefgiehjai', description: '' }, { name: 'Native Client', filename: 'internal-nacl-plugin', description: '' } ];

The follow by Object.setPrototypeOf(formatted_plugin, Plugin.prototype);

Good luck!