Closed longnd closed 1 year ago
The reasons I picked puppeteer over casual HTTP libraries:
DELAY_BETWEEN_CHUNK_MS
and increase the CHUNK_SIZE
). Moreover, we can remove the need for proxies and save costs. Initially, I thought being able to solve captcha alone completely outweighs its cons.I appreciate your suggestions. Since currently no captcha solving service is being used, I will make a feature/http-scraper branch to give it a go and we'll see how things turn out.
I figured that the stack of axios, cheerio, and random-useragent works fine with sending proxified requests. However, another major drawback of this approach is that http requests couldn't obtain the search performance results, e.g., About 10,000 results (0.60 seconds)
. In the screenshot below, left side is the Axios cache, right side is the Puppeteer cache
I can't find an explanation for this. I assume that being a headless browser, Puppeteer is able to get a more complete HTML content of the page.
This drawback actually matters because in the application requirements, it's stated that:
For each search result/keyword result page on Google, store the following information on the first results page: The total search results for this keyword, e.g., About 21,600,000 results (0.42 seconds)
Since this approach clearly doesn't deliver what's expected, if I were to pick HTTP libraries from the beginning I'd have to switch to other alternatives nonetheless. I hope we're on a same page on getting it done right is better than getting it done quick.
As always, I'm open to other alternatives to enhance the scraping process.
try using the proxies and rotating the user's agent in the request.
I tried this at the very first. Combining proxies with random user agents, meanwhile continually decreasing the
sleep()
delay. That didn't work out - as if using user agent doesn't have a bit of impact on increasing performance. And that makes sense because Google's main criteria to detect is based on the IP address of the request (they take lots of other steps too, and they're not gonna be public about that). Which means it all boils down to keeping the proxies not overused.
thank you for the effort spending on trying another approach as suggested
I hope we're on a same page on getting it done right is better than getting it done quick.
I agree that getting things done right is important.
Since this approach clearly doesn't deliver what's expected, if I were to pick HTTP libraries from the beginning I'd have to switch to other alternatives nonetheless
since I don't know how you implemented that so I can't guess if there was anything wrong. But that solution - using axios with random user's agent (even without the proxy) should be able to get the search result in an expected way. I have seen other candidates made similar approach and get the results they want. Here is a simple code as an example
import { HttpService } from '@nestjs/axios';
const USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 12.6; rv:106.0) Gecko/20100101 Firefox/106.0',
'Mozilla/5.0 (X11; Linux i686; rv:106.0) Gecko/20100101 Firefox/106.0',
'Mozilla/5.0 (X11; Linux x86_64; rv:106.0) Gecko/20100101 Firefox/106.0',
'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:106.0) Gecko/20100101 Firefox/106.0',
'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:106.0) Gecko/20100101 Firefox/106.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 Edge/106.0.1370.47',
];
function randomUserAgent(): string {
const randomIndex = Math.floor(Math.random() * USER_AGENTS.length);
return USER_AGENTS[randomIndex];
}
...
const res = await this.httpService.axiosRef.get(
`https://www.google.com/search?q=${query}&gl=us&hl=en`,
{
headers: {
'User-Agent': randomUserAgent(),
},
},
);
const html = res.data;
...
Thank you for your sample. It turns out that the search performance results will only appear with certain type of User Agent, which means the UA have to be "hand picked" and not generate by random.
In #43 I've tried replacing Puppeteer with Axios and Cheerio to see how things turn out. It worked out quite significant:
sleep()
is gone)To conclude, my initial assumption on the pros of Puppeteer is wrong, i.e, being solely able to solve captcha does not outweigh the costs, considering how resource demanding it is.
I truly appreciate your suggestions and support.
Thank you for spending effort on the improvement. As mentioned in https://github.com/21jake/nimble-scraper/issues/35#issuecomment-1464949549, one issue I noticed in the PR is that the current approach processes keywords in a loop, making the process brittle and inefficient. Failure related to one keyword will stop the processing of the other keywords, so separate asynchronous processes should be used for each keyword.
A for loop helps and reduce the risk of traffic spike and getting detected. We can always wrap up and trigger all search requests at the same time, but that would make the proxies more prone to detection. Please understand in a stealth job, being fast does not equal being efficient.
Issue
The Scrapping Service has a creative way to use proxies to overcome the mass-searching detection from Google. However, it is using puppeteer, which requires Chromium running in headless mode https://github.com/21jake/nimble-scraper/blob/f0673eb1420deabbd48ad2491bc10633254e908a/backend/src/services/scraper.service.ts#L29-L34 it requires more resources to work, as pointed out in the Readme
I'm curious why don't you use an HTTP library, e.g.
axios
to send the search requests and parse the result (e.g. using a library like cheerio) instead? it would be way more effecient.Also, as mentioned in #35, instead of using
sleep
and make the code way to overcome the detection from Google, there should be a better way e.g. by using the proxies and rotating the user's agent in the request.Expected
The scrapping process in handled in a more efficient way.