Closed robertistok closed 3 years ago
Did a few tests, turns out there is no notable difference if the user brings its own puppeteer or now. I am closing it for now!
Hey thanks a lot for the input, I was thinking of adding functionality to scrape asynchronously to bring down execution time, so you made a great point! May I ask how exactely you implemented this? I'm surprised that it had no notable performance improvement.
I was reading about an async workflow in puppeteer and came across this article: https://advancedweb.hu/how-to-speed-up-puppeteer-scraping-with-parallelization/
This is the new function I added, which can be called to get an instance of the scraper.
async function OpenseaScraperInstance({ mode } = {}) {
const browser = await puppeteer.launch({
headless: mode === "debug" ? false : true,
args: ['--start-maximized'],
});
return {
close: () => browser.close(),
basicInfo: (...params) => basicInfo(...params), // no browser here, as it's an api call
floorPrice: (...params) => floorPrice(...params, { ...(params && params.opts ? params.opts : {}), browser}),
floorPriceByUrl: (...params) => floorPriceByUrl(...params, { ...(params && params.opts ? params.opts : {}), browser}),
rankings: (...params) => rankings(...params, { ...(params && params.opts ? params.opts : {}), browser}),
offers: (...params) => offers(...params, { ...(params && params.opts ? params.opts : {}), browser}),
}
}
I've realized that I made some errors initially, that I was still closing the browser after every call, even though it was not supposed to if one is already provided. Now that I fixed, and with concurrency enabled you can see the results down below:
6 calls to the /offers
endpoint:
Using the instance from above: 32.3s
The current implementation: 39.03s
This is quite a significant difference, and it would just increase the more times the function is called.
Interesting, but 7 seconds seem not like an all to big improvement to be honest. But I think it still would make sense because of improved memory usage (which is important when you launch it on the cloud). But I would like to first test this assumption.
Also I am not really sure about the proposed structure, I am in favor of Option 1. It might be more complex for the caller, but on the other hand it also gives full control over the puppeteer instance, so the caller can decide iif he wants to use stealth plugin or not, proxies and all other stuff. But as of now I would leave this open until I've done some more testing.
I also added some comments on your PR.
It's a 25% reduction on 6 calls, which can be interesting if there is more. But yeah, the memory saving is also quite good.
Regarding the implementation, the PR I proposed supports that as well. You could either bring your own puppeteer or call the getInstanceWithPuppeteer
function and use that.
Hello!
In case one would like to run the scraper in a worker, to check for example the offers for multiple projects asynchronously, the current solution would launch a new
puppeteer
instance for each call.There are two solutions that came to my mind:
puppeteer
instance. This is not the most elegant solution IMO, as it would be more complex for the caller.OpenseaScraper
class, that would need to be instantiated. In the constructor, we would launch apuppeteer
instance and pass it down to the functions, as described in the first point. We could export both the functions as it is now and the new class.What do you think? I think it would be valuable to have this functionality, as it would add lots of efficiencies when the functions are called one after another.
I'd be happy to submit a PR for this!