Granitosaurus / scrapecrow

0 stars 2 forks source link

asynchronous-web-scraping #1

Open utterances-bot opened 2 years ago

utterances-bot commented 2 years ago

Scrapecrow - Asynchronous Web Scraping: Scaling For The Moon!

Educational blog about web-scraping, crawling and related data extraction subjects

https://scrapecrow.com/asynchronous-web-scraping.html

al22xx commented 2 years ago

Thank you for this blog great work! you've also tried to help me on Reddit which I really appreciate - As Im new in web scraping in general & am still trying to find the best method, my question is: even though Asynchronous scraping does it in lightening speed without putting too much pressure on the server (I expect), then how can I avoid being identified as a bot & avoid being blacklisted from the site? as most the documentation I've read so far pointing to even put a time.sleep() or even a random time.sleep() & try to mimic a human - so how do you avoid being blocked by a website using Asynchronous scraping? I appreciate your answer with an actual code or e.g. Headers you use please

Granitosaurus commented 2 years ago

@al22xx you've just stumbled on to the biggest subject in this particular medium! The big problem of web-scraping is scaling and bot detection avoidance.

To quickly summarize it: some websites want to serve pages to only to humans but not bots, so how can they tell the difference between the two?

One way is that users usually execute javascript that is included in the page but bots don't (i.e. python with requests package doesn't run any javascript that is embedded in the html).
So, if you want your bot to blend in you either have to reverse-engineer javascript and other functions that generate requests, headers etc. in your bot so it mimics normal user. This is extremely resource efficient but it takes a lot of human effort to develop. Alternatively, you can take a real browser and emulate it. This is really inefficient as browsers are heavy and complex but it's easy to develop as no reverse-engineering knowledge is required. In other words, you can either reverse-engineer how website works or emulate everything with Chrome or firefox (packages like Playwright, Puppeteer, Selenium)

Another way to detect bots is to track their IPs. There are varying quality of IPs: data center ips, residential ips and mobile ips - in that order. So your bot might need to use proxies to avoid being identified as one power-user (no person can visit 1_000 pages a minute, right?)

So to summarize - there are two ways for targets to track of clients: their javascript execution and their connection IP. There are a lot of ways to deal with this issue and it entirely depends on your project. Recently, I've been teaming up with the guys at https://scrapfly.io which offers a middleware service for exactly that: ensuring your http requests are undetected and reliable. There's a free plan - check it ou! :)

al22xx commented 2 years ago

Thank you for your response, I have to admit I have written that question a while ago & meant to ask you for a while - I know you mention couple of methods here Throttling & Leaky Bucket but as a novice I was wondering why we come up with a super fast scraping method only then have to slow it down for not to be detected. I will look into https://scrapfly.io thank you for all your great work! If I were you I put this article on https://medium.com/ & make some money...

Granitosaurus commented 2 years ago

@al22xx regarding the throttling issue - it's much easier to scale down something than to scale it up. Think of it this way: it's easier to control how much you eat when you have unlimited supply of food - if you have a little bit of food, well then, you're just starving!

Some websites are bot friendly and we can really push it, some aren't and we must slow down for optmal performance. Also, this rate is often not set in stone, for example a lot of website would have scaling anti-bot protection so it's more aggressive during peak hours (daytime) and less aggressive during off hours (like night time) - so if we're smart with our scraping strategy we can push our speeds quite a bit!

As for medium.com - unfortunately it doesn't pay enough to relinquish styling and publishing control just yet but thanks for the kind words and the suggestion! :)