Closed baskaufs closed 2 years ago
The current scripts and data are in the dashboard repo. They get the data from various APIs and use an SDK for GitHub to commit the data daily after 00:00 UTC.
The lambda function is currently a work in progress. Issues dealt with during the conversion:
requests
module to urllib3
to avoid having to install any modules.The major problem encountered was using the PyGithub module. It is required to instantiate a Github object that's used to interact with the API. I followed this procedure to install it as a package, but couldn't get it to run without errors:
chmod 400 /Users/baskausj/baskauf_python_lambda_package_builder.pem
ssh -i baskauf_python_lambda_package_builder.pem ec2-user@ec2-3-83-117-207.compute-1.amazonaws.com
pip3 install --target ./package PyGithub
scp -i baskauf_python_lambda_package_builder.pem ec2-user@ec2-3-83-117-207.compute-1.amazonaws.com:my-sourcecode-function/my-deployment-package.zip /Users/baskausj/my-deployment-package.zip
lambda_function.py
to the zip file using: zip -g my-deployment-package.zip lambda_function.py
.Unfortunately, still threw an error. See this discussion about the error message: https://stackoverflow.com/questions/57189352/aws-lambda-unable-to-import-module-python-handler-no-module-named-cffi-bac/57221052#57221052
A blog post mentions creating a Lambda layer as the best solution, since they can be used across different Lambdas without requiring building all of the packages and uploading them with a zip.
For now I am mothballing the EC2 by stopping (but not deleting) it.
Gave up on pushing the data to GitHub, since I couldn't get the PyGithub module to deploy. Saving the results in a public S3 bucket is probably a better idea, since the way the script was running, if it ever failed to connect with GitHub (API down), the files got messed up. This actually happened for over a year and I didn't notice it. So S3 is more reliable.
The final code is here. It's running on us-east-1 and is called collect_api_data
. Set up the trigger as an EventBridge (formerly CloudWatch) with arn:aws:events:us-east-1:555751041262:rule/trigger_api_data_collection, using the cron expression 5 0-1 * * ? *
, which runs at 00:05 and then 01:05 every day. It keeps track of the last updated status in a last_run.json
file. If any update is unsuccessful at 00:05, it gets run again at 01:05. If it's unsuccessful a second time, it sends me an email telling me the name of the file that failed to update.
Currently the Wikidata edit data, data on researcher pubs, YouTube video views, and views of Gallery Commons pages are being collected daily by Python scripts that are running all the time on Steve's laptop. They need to be moved to AWS and run on a cron job as a Lambda.