brianleect / etherscan-labels

Full label data dump of top EVM chains in JSON/CSV.
MIT License
256 stars 77 forks source link

[Feat] [Bug] Etherscan Cloudflare bypass #18

Closed brianleect closed 1 year ago

brianleect commented 1 year ago

Seems that etherscan might have implemented an additional layer of scraping protection. In an attempt to scrape today it appears that while logged in I got blocked by a cloudflare linked page. Might be a major problem.

Will need to research further and see if it occurs often or was a one off case.

starguyaman commented 1 year ago

I see you are using simple selenium, which can be caught easily as a bot. Have you tried using undetected-chromedriver, or selenium-wire? They are very good at bypassing anti-bot tests.

brianleect commented 1 year ago

Thanks for the suggestion, I'll take a look at it! Do feel free to open a PR with this integrated if you happen to be more familiar with utilizing the libraries.

Currently it seems that simple selenium works for bscscan and polygonscan but not etherscan.

brianleect commented 1 year ago

Might have scraped too heavily, getting the issue on bscscan now as well.

brianleect commented 1 year ago

Currently main problem consistently occurs for scraping eth (old scraped done) and optimism (TBD)

starguyaman commented 1 year ago

I happened to be working on web scraping a lot these days, I will try implementing selenium-wire or undetected-chromedriver. Hopefully, that will resolve the issue. Can you assign this issue to me?

dante11235 commented 1 year ago

Hey guys. I tried using the selenium wire with residental proxies (workaround, that worked few weeks ago). I haven't tried using the undetected selenium. Try this - https://fingerprintjs.github.io/BotD/main/ this could be a good start to see if it detects your broswer correctly (this helped me last time)

starguyaman commented 1 year ago

@dante11235, that's a great website thanks for sharing that