[Feature] Improve the efficiency of the scraping process

longnd commented 1 year ago

Issue

The Scrapping Service has a creative way to use proxies to overcome the mass-searching detection from Google. However, it is using puppeteer, which requires Chromium running in headless mode https://github.com/21jake/nimble-scraper/blob/f0673eb1420deabbd48ad2491bc10633254e908a/backend/src/services/scraper.service.ts#L29-L34 it requires more resources to work, as pointed out in the Readme

Currently a 2-CPU 4GB Ubuntu server with 22 proxies can handle up to 7 concurrent uploads before showing sign of scraping failures (Captcha-ed, Timeout, etc).

I'm curious why don't you use an HTTP library, e.g. axios to send the search requests and parse the result (e.g. using a library like cheerio) instead? it would be way more effecient.

Also, as mentioned in #35, instead of using sleep and make the code way to overcome the detection from Google, there should be a better way e.g. by using the proxies and rotating the user's agent in the request.

Expected

The scrapping process in handled in a more efficient way.

21jake commented 1 year ago

The reasons I picked puppeteer over casual HTTP libraries:

Puppeteer can solve captcha, others can not (to my knowledge). This theoretically means every search request will succeed and significantly increases performance (given the fact that we don't need to worry about being detected - thus remove the DELAY_BETWEEN_CHUNK_MS and increase the CHUNK_SIZE). Moreover, we can remove the need for proxies and save costs. Initially, I thought being able to solve captcha alone completely outweighs its cons.
I didn't have a enjoyable experience in the past with making proxified requests using NodeJS HTTP (this Axios bug especially). I should've revisited the issue to see that it's closed.

I appreciate your suggestions. Since currently no captcha solving service is being used, I will make a feature/http-scraper branch to give it a go and we'll see how things turn out.

21jake commented 1 year ago

Testing results

I figured that the stack of axios, cheerio, and random-useragent works fine with sending proxified requests. However, another major drawback of this approach is that http requests couldn't obtain the search performance results, e.g., About 10,000 results (0.60 seconds). In the screenshot below, left side is the Axios cache, right side is the Puppeteer cache

Screen Shot 2023-03-10 at 10 36 48

I can't find an explanation for this. I assume that being a headless browser, Puppeteer is able to get a more complete HTML content of the page.

This drawback actually matters because in the application requirements, it's stated that:

For each search result/keyword result page on Google, store the following information on the first results page: The total search results for this keyword, e.g., About 21,600,000 results (0.42 seconds)

Since this approach clearly doesn't deliver what's expected, if I were to pick HTTP libraries from the beginning I'd have to switch to other alternatives nonetheless. I hope we're on a same page on getting it done right is better than getting it done quick.

As always, I'm open to other alternatives to enhance the scraping process.

Update:

try using the proxies and rotating the user's agent in the request.

I tried this at the very first. Combining proxies with random user agents, meanwhile continually decreasing the sleep() delay. That didn't work out - as if using user agent doesn't have a bit of impact on increasing performance. And that makes sense because Google's main criteria to detect is based on the IP address of the request (they take lots of other steps too, and they're not gonna be public about that). Which means it all boils down to keeping the proxies not overused.

minified-2

longnd commented 1 year ago

thank you for the effort spending on trying another approach as suggested

I hope we're on a same page on getting it done right is better than getting it done quick.

I agree that getting things done right is important.

Since this approach clearly doesn't deliver what's expected, if I were to pick HTTP libraries from the beginning I'd have to switch to other alternatives nonetheless

since I don't know how you implemented that so I can't guess if there was anything wrong. But that solution - using axios with random user's agent (even without the proxy) should be able to get the search result in an expected way. I have seen other candidates made similar approach and get the results they want. Here is a simple code as an example

import { HttpService } from '@nestjs/axios';

const USER_AGENTS = [
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
  'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
  'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
  'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0',
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 12.6; rv:106.0) Gecko/20100101 Firefox/106.0',
  'Mozilla/5.0 (X11; Linux i686; rv:106.0) Gecko/20100101 Firefox/106.0',
  'Mozilla/5.0 (X11; Linux x86_64; rv:106.0) Gecko/20100101 Firefox/106.0',
  'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:106.0) Gecko/20100101 Firefox/106.0',
  'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:106.0) Gecko/20100101 Firefox/106.0',
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 Edge/106.0.1370.47',
];

function randomUserAgent(): string {
  const randomIndex = Math.floor(Math.random() * USER_AGENTS.length);
  return USER_AGENTS[randomIndex];
}

...
const res = await this.httpService.axiosRef.get(
  `https://www.google.com/search?q=${query}&gl=us&hl=en`,
  {
    headers: {
      'User-Agent': randomUserAgent(),
    },
  },
);
const html = res.data;
...

21jake commented 1 year ago

Thank you for your sample. It turns out that the search performance results will only appear with certain type of User Agent, which means the UA have to be "hand picked" and not generate by random.

In #43 I've tried replacing Puppeteer with Axios and Cheerio to see how things turn out. It worked out quite significant:

Required info is scraped
~65% reduce in scraping time (From 4.2 minutes to about 1.5 minutes for 100 keywords)
~65% reduce in Docker image size (from 1.4 GB to 500 MB)
No more hacky workarounds (the sleep() is gone)
Better organized codebase

To conclude, my initial assumption on the pros of Puppeteer is wrong, i.e, being solely able to solve captcha does not outweigh the costs, considering how resource demanding it is.

I truly appreciate your suggestions and support.

longnd commented 1 year ago

Thank you for spending effort on the improvement. As mentioned in https://github.com/21jake/nimble-scraper/issues/35#issuecomment-1464949549, one issue I noticed in the PR is that the current approach processes keywords in a loop, making the process brittle and inefficient. Failure related to one keyword will stop the processing of the other keywords, so separate asynchronous processes should be used for each keyword.

https://github.com/21jake/nimble-scraper/blob/0687d4759b8b7d8440c41f4040bc9c9381102305/backend/src/services/scraper.service.ts#L33-L34

21jake commented 1 year ago

A for loop helps and reduce the risk of traffic spike and getting detected. We can always wrap up and trigger all search requests at the same time, but that would make the proxies more prone to detection. Please understand in a stealth job, being fast does not equal being efficient.

21jake / nimble-scraper