dcts / opensea-scraper

Scrapes nft floor prices and additional information from opensea. Used for https://nftfloorprice.info
MIT License
184 stars 73 forks source link

New puppeteer launch for every call of the functions #8

Closed robertistok closed 3 years ago

robertistok commented 3 years ago

Hello!

In case one would like to run the scraper in a worker, to check for example the offers for multiple projects asynchronously, the current solution would launch a new puppeteer instance for each call.

There are two solutions that came to my mind:

  1. Add the third argument, object named opts, to the functions, where users would be able to bring their own puppeteer instance. This is not the most elegant solution IMO, as it would be more complex for the caller.
  2. Export an OpenseaScraper class, that would need to be instantiated. In the constructor, we would launch a puppeteer instance and pass it down to the functions, as described in the first point. We could export both the functions as it is now and the new class.

What do you think? I think it would be valuable to have this functionality, as it would add lots of efficiencies when the functions are called one after another.

I'd be happy to submit a PR for this!

robertistok commented 3 years ago

Did a few tests, turns out there is no notable difference if the user brings its own puppeteer or now. I am closing it for now!

dcts commented 3 years ago

Hey thanks a lot for the input, I was thinking of adding functionality to scrape asynchronously to bring down execution time, so you made a great point! May I ask how exactely you implemented this? I'm surprised that it had no notable performance improvement.

I was reading about an async workflow in puppeteer and came across this article: https://advancedweb.hu/how-to-speed-up-puppeteer-scraping-with-parallelization/

robertistok commented 3 years ago

This is the new function I added, which can be called to get an instance of the scraper.

async function OpenseaScraperInstance({ mode } = {}) {
  const browser = await puppeteer.launch({
    headless: mode === "debug" ? false : true,
    args: ['--start-maximized'],
  });

  return {
    close: () => browser.close(),
    basicInfo: (...params) => basicInfo(...params), // no browser here, as it's an api call
    floorPrice: (...params) => floorPrice(...params, { ...(params && params.opts ? params.opts : {}), browser}),
    floorPriceByUrl: (...params) => floorPriceByUrl(...params, { ...(params && params.opts ? params.opts : {}), browser}),
    rankings: (...params) => rankings(...params, { ...(params && params.opts ? params.opts : {}), browser}),
    offers: (...params) => offers(...params, { ...(params && params.opts ? params.opts : {}), browser}),
  }
}

I've realized that I made some errors initially, that I was still closing the browser after every call, even though it was not supposed to if one is already provided. Now that I fixed, and with concurrency enabled you can see the results down below:

6 calls to the /offers endpoint: Using the instance from above: 32.3s The current implementation: 39.03s

This is quite a significant difference, and it would just increase the more times the function is called.

dcts commented 3 years ago

Interesting, but 7 seconds seem not like an all to big improvement to be honest. But I think it still would make sense because of improved memory usage (which is important when you launch it on the cloud). But I would like to first test this assumption.

Also I am not really sure about the proposed structure, I am in favor of Option 1. It might be more complex for the caller, but on the other hand it also gives full control over the puppeteer instance, so the caller can decide iif he wants to use stealth plugin or not, proxies and all other stuff. But as of now I would leave this open until I've done some more testing.

I also added some comments on your PR.

robertistok commented 3 years ago

It's a 25% reduction on 6 calls, which can be interesting if there is more. But yeah, the memory saving is also quite good.

Regarding the implementation, the PR I proposed supports that as well. You could either bring your own puppeteer or call the getInstanceWithPuppeteer function and use that.