Move data collection scripts to AWS Lambda

baskaufs commented 2 years ago

Currently the Wikidata edit data, data on researcher pubs, YouTube video views, and views of Gallery Commons pages are being collected daily by Python scripts that are running all the time on Steve's laptop. They need to be moved to AWS and run on a cron job as a Lambda.

baskaufs commented 2 years ago

The current scripts and data are in the dashboard repo. They get the data from various APIs and use an SDK for GitHub to commit the data daily after 00:00 UTC.

baskaufs commented 2 years ago

The lambda function is currently a work in progress. Issues dealt with during the conversion:

Obfuscate credentials by storing as binaries in opaquely-named files in an S3. Loaded using pickle.
Convert from using requests module to urllib3 to avoid having to install any modules.
Minor cleanup in format; use doc strings for comments.

The major problem encountered was using the PyGithub module. It is required to instantiate a Github object that's used to interact with the API. I followed this procedure to install it as a package, but couldn't get it to run without errors:

To use the GitHub API, you need to deploy PyGitHub: https://github.com/PyGithub/PyGithub . See https://docs.aws.amazon.com/lambda/latest/dg/configuration-function-zip.html#configuration-function-update for general instructions about uploading as a .zip archive.
Follow the instructions at https://docs.aws.amazon.com/lambda/latest/dg/python-package.html#python-package-create-package-with-dependency
To avoid incompatibility problems when the package is created in a different OS, it is necessary to create the package on an EC2 instance that's running the same version of Linux as the Lambda.
Check https://docs.aws.amazon.com/lambda/latest/dg/lambda-runtimes.html to find the Linux version for the Python3 used in the lambda (Amazon Linux 2 for x86_64)
Created baskauf_python_lambda_package_builder2 EC2 t2micro instance. Used PEM baskauf_python_lambda_package_builder.pem
Allow access to key by issuing command: chmod 400 /Users/baskausj/baskauf_python_lambda_package_builder.pem
Connection notes at https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html
Connect using: ssh -i baskauf_python_lambda_package_builder.pem ec2-user@ec2-3-83-117-207.compute-1.amazonaws.com
Python3 (and pip3) is already installed in the EC2 along with the OS.
Used command pip3 install --target ./package PyGithub
After finishing creating the .zip, use SCP to get it from the EC2 to local drive (in a terminal window that's not SSH'ed to the EC2): scp -i baskauf_python_lambda_package_builder.pem ec2-user@ec2-3-83-117-207.compute-1.amazonaws.com:my-sourcecode-function/my-deployment-package.zip /Users/baskausj/my-deployment-package.zip
Then add the lambda_function.py to the zip file using: zip -g my-deployment-package.zip lambda_function.py.

Unfortunately, still threw an error. See this discussion about the error message: https://stackoverflow.com/questions/57189352/aws-lambda-unable-to-import-module-python-handler-no-module-named-cffi-bac/57221052#57221052

A blog post mentions creating a Lambda layer as the best solution, since they can be used across different Lambdas without requiring building all of the packages and uploading them with a zip.

For now I am mothballing the EC2 by stopping (but not deleting) it.

baskaufs commented 2 years ago

Gave up on pushing the data to GitHub, since I couldn't get the PyGithub module to deploy. Saving the results in a public S3 bucket is probably a better idea, since the way the script was running, if it ever failed to connect with GitHub (API down), the files got messed up. This actually happened for over a year and I didn't notice it. So S3 is more reliable.

The final code is here. It's running on us-east-1 and is called collect_api_data. Set up the trigger as an EventBridge (formerly CloudWatch) with arn:aws:events:us-east-1:555751041262:rule/trigger_api_data_collection, using the cron expression 5 0-1 * * ? *, which runs at 00:05 and then 01:05 every day. It keeps track of the last updated status in a last_run.json file. If any update is unsuccessful at 00:05, it gets run again at 01:05. If it's unsuccessful a second time, it sends me an email telling me the name of the file that failed to update.

HeardLibrary / vandycite

Move data collection scripts to AWS Lambda #84