ScriptSmith / instamancer

Scrape Instagram's API with Puppeteer
http://adamsm.com/instamancer
MIT License
398 stars 61 forks source link

[FEATURE] Serverless Framework Support #37

Open necevil opened 4 years ago

necevil commented 4 years ago

Is your feature request related to a problem? Please describe. Many of the API endpoints for instamancer could in theory be ported to a Serverless function that relies on either AWS Lambda (with puppeteer layer) or Google Cloud Functions (that automatically has access to puppeteer by default). This would increase the scalability of the solution and also allow lower level / starter users to take advantage of their free Lambda / function executions on a monthly basis.

Describe the solution you'd like Add Serverless Framework as a dependency and create a serverless config file to handle configuration when deploying.

Describe alternatives you've considered Serverless Framework would help to abstract the difference in platforms etc for anyone who wants to run this serverlessly have not considered alternatives.

Additional context The biggest issue will be data persistence (where to deposit photos / which db to insert records into).

ScriptSmith commented 4 years ago

Your proposal is interesting, there are a couple of things to consider.

1) Cold-boot time for the lambda function would be prohibitively slow given that a browser has to launch, load the page, and retrieve the results from the API.

2) Assuming it would act as an http endpoint, a containerised application would be just as simple to use, more cross-platform accessible, and could be run locally just as easily.

3) This feature might need to be its own project rather than part of the instamancer package. Instamancer would remain the core module, and then you can build whatever server system you want around it. Having an authoritative server model doesn't really justify adding additional weight and complexity to the current module (except perhaps if it was just an extremely simple express server).

Interested to hear your thoughts. We could create a new repo in https://github.com/instamancer

necevil commented 4 years ago

True. I think a secondary repo would for sure make sense.

On the cold boot & browser spin up time In my experience the bigger issue is the spin up time for the browser. Most of the reasoning behind working on a more sophisticated deployment is to handle a larger (possibly concurrent instances executing at once) and/or more consistently executed / scheduled use case.

Since the Cold Boot only applies to containers that haven't been run in a while USUALLY it doesn't add a huge amount of overhead on it's own since really it's only your first execution. This assumes running the container 50 or 100 times after the first execution (which warms it up).

In cases where thousands of lambda / serverless executions occur prior to cool down the overhead for the warm up doesn't end up impacting things in a meaningful way (in my experience!).

The bigger issue is the browser spinning up each time — but again to me this is just sort of par for the course to avoid the scraping defenses out there (in this case with instagram) by using Chrome / Puppeteer, but I think it's worth it.

In my experience almost all of the user behaviors that can be used to detect a scraper can be replicated in puppeteer so there is a huge amount of value / resilience added to the project by relying on Chrome — even though you eat the above mentioned overhead.

If the reasoning behind moving toward serverless is to be able to abstract the management of consistently run (every day, every hour, etc) Instamancer queries then this separated project could also provide for the use / application of proxies to allow concurrent executions for larger projects.

I would like to play around with Instamancer a little more in a containerized environment but I don't see any reason why it would be super hard to configure.

Have you done any work on containerization / dockerization locally? I can probably at the least contribute there!