Open ConorRoberts opened 2 years ago
Thoughts: How often should we run the scraper loop? Should it be a switch on/off? once per sem?
Priority queue? Allow for us to push scrapers to the front of the list of queued scrapers before resuming others?
Tracking: Similar to how often we should scrape, how should we track scrapes? RDS?
Framework: Express? Nest? Something else?
@BCarlson1512 Definitely want to run scraper loop up to multiple times per day (during registration windows) to make sure we have the most up to date value for seats available. Outside of registration season we can get away with once every month or two honestly.
Queue makes sense for a number of reasons. Can make use of existing RabbitMQ instance.
We can keep logs in S3 or something and track progress in Neo4j. There's likely a more suitable database out there but our use case is simple and I already pay for Neo4j.
Won't need any kind of API framework. We'll run all this through a RabbitMQ consumer within a Docker container within some EC2.
Looking into AWS eventbridge jobs
EventBridge schedule triggers -> Lambda -> start EC2 -> scrape & save -> stop EC2
Moving away from ECS, looking into finding an alternative for scraping
Architecture remains the same. ECS is not part of this process.
Setup a simple express server, hosted on EC2, that performs the scraping and saving operations. These are currently performed by me on my local machine but I'd like to move them to AWS.
Ideally, scraper saves the result to S3 bucket. Manual review takes place there. Then we hit the
/save
endpoint to trigger the data being loaded into Neo4j.