ericphanson / arxiv-search

Elasticsearch-backed rewrite of arxiv-sanity
MIT License
4 stars 1 forks source link

migration of core pipeline to EC2 #5

Open ericphanson opened 6 years ago

ericphanson commented 6 years ago

Right now, WatchStatus.. lambda reads the dynamodb table stream and fire (wrapped) Lambdas accordingly.

Instead, WatchStatus should put the info to fire the Lambda into an SQS queue.

An EC2 instance should be configured to read the SQS queue and process events, by "firing" the associated lambda (i.e. running the corresponding code on the EC2 instance itself).

Why?

  1. Cost: running 10 million Lambdas for 10 seconds each costs $200, and that's a realistic amount of lambdas for processing the million paper in the arxiv
  2. Still scales reasonably well. The SQS queue has nice properties so we don't miss events, and we can always boot up more EC2 instances to process the queue in parallel. We keep the paralellized structure we had from the lambdas (each queue event causes 1 thing to happen which updates the table, triggering the next event down the line, which could even be processed by a different EC2 instance).

What needs to be done?

Need to