Added the ability to run this with Serverless (AWS Lambda) and persist to S3 with CloudFront Logging

rvilim commented 5 years ago

This patch adds a way to make this script serverless to both cut down on costs (~$2 per year per year) and improve reliability.

It uses the Serverless framework to run the TTC scrape script as a cron job (serverless is snazzy! it even accounts for daylight savings!), then persists to S3. Cloudwatch automatically pulls in logs, which is pretty rudimentary right now but could be improved.

The major changes involve adding the handler method for the Lambda entry point and pulling out all the writing code into classes in writing.py. This lets us have Postgres and S3 as separate writers and the rest of the code not care which one gets used.

Major changes:

The ttc_api_scraper.py scrape usage now requires a --s3 or --postgres
Logging to a file only happens if the LOG_FILENAME environmental variable is set (this was for cleanness sake with serverless). Additionally I removed the logging config from db.cfg, again this made the serverless entry point nicer and I don't think it needed to be there.
I re-categorized some error levels so only actual errors got promoted
I rejiggered the try/catches on aiohttp to catch more informative errors
The S3 files are stored in .tar.gz in a similar format to the rows in the Postgres database. A key difference is that the ids are uuids instead of incrementing numbers. Since lambdas are stateless we can't keep track of a number so I just used something that will never have a collision
These are currently getting written to s3://rvilim.ttc.scrape, that should be publicly accessible

Todo: A query tool that automatically pulls the relevant files from S3 and turns it into a CSV (like the postgres script)

rvilim commented 5 years ago

Sorry I'll write some better docs. It's American thanksgiving from Thursday onwards so I should have some time.

So re:Circle, unless I'm missing something I can't see any tests that test the Postgres capabilities. I mentioned this in the docs I did write, but I left that the postgres functionality intact (I just split it out in the code). What I did change was that I added a command line flag to specify where you wanted it to go (either --postgres or --s3, not both). I definitely didn't modify any tests to accommodate that flag though.

radumas commented 5 years ago

Tagging this to #33 so people see it

CivicTechTO / ttc_subway_times

Added the ability to run this with Serverless (AWS Lambda) and persist to S3 with CloudFront Logging #50