Setup EC2 to run scraping/saving

ConorRoberts / dogs-barking

https://dogs-barking.ca/

6 stars 1 forks source link

Setup EC2 to run scraping/saving #128

Open ConorRoberts opened 2 years ago

ConorRoberts commented 2 years ago

Setup a simple express server, hosted on EC2, that performs the scraping and saving operations. These are currently performed by me on my local machine but I'd like to move them to AWS.

Ideally, scraper saves the result to S3 bucket. Manual review takes place there. Then we hit the /save endpoint to trigger the data being loaded into Neo4j.

BCarlson1512 commented 2 years ago

Thoughts: How often should we run the scraper loop? Should it be a switch on/off? once per sem?

Priority queue? Allow for us to push scrapers to the front of the list of queued scrapers before resuming others?

Tracking: Similar to how often we should scrape, how should we track scrapes? RDS?

Framework: Express? Nest? Something else?

ConorRoberts commented 2 years ago

@BCarlson1512 Definitely want to run scraper loop up to multiple times per day (during registration windows) to make sure we have the most up to date value for seats available. Outside of registration season we can get away with once every month or two honestly.

Queue makes sense for a number of reasons. Can make use of existing RabbitMQ instance.

We can keep logs in S3 or something and track progress in Neo4j. There's likely a more suitable database out there but our use case is simple and I already pay for Neo4j.

Won't need any kind of API framework. We'll run all this through a RabbitMQ consumer within a Docker container within some EC2.

BCarlson1512 commented 2 years ago

Looking into AWS eventbridge jobs

ConorRoberts commented 2 years ago

EventBridge schedule triggers -> Lambda -> start EC2 -> scrape & save -> stop EC2

BCarlson1512 commented 2 years ago

Moving away from ECS, looking into finding an alternative for scraping

ConorRoberts commented 2 years ago

Architecture remains the same. ECS is not part of this process.