Automate pipeline - Githubissues

pmayd commented 1 year ago

Idea:

automate pipeline when uploading files
uploading a file to a bucket will tricker a function/container to process this file
(new) data is automatically ingested into the database

tasosbada commented 1 year ago

I think that the automation process should be done at least in two-step.

Container for cleaning the data. Which is based on our data cleaning pipeline in R.
Container that runs a bash script for the data upload and runs bigquery queries. This script could be stored in gcs, so we could modify it.

For the first container, the files can be found in https://storage.cloud.google.com/a4d-315220-documents/docker-a4d-data-extraction/docker-a4d-data-extraction.zip, documentation can be found in the readme.md file. The problem is that although it worked for me locally, it could get deployed on gcp cloud run. It crashed due to the "devtools". I did not try it with the latest version R and our code. Possible solution could be installed dependencies without "devtools" or try it on Kubernetes cluster.

The second container could even be a cloud functions, but it needs communication and access to our gcs for the bash script. I intend to build a container to test this approach. Using bash script on our gcs, we have the flexibility to adapt and use this container only as runtime.

tasosbada commented 12 months ago

The docker image template and the repository for the second step (Container that runs a bash script for the data upload and runs ...) with the instruction can be found in our bucket. I zipped it and store it our bucket in case that you want to use in the future. Information and the step-by-step process can be found in the readme.md file.

tasosbada commented 11 months ago

In zip file can be found the necessary files and documentation for building and deployment cleaning data pipeline on GCP Cloud Run. The problem that was mentioned above is fixed and the pipeline run on Cloud Run. It runs, but since I do not have any real data input files it complains about this, otherwise it generates the log file properly, which means that there is no problem in execution and all the packages are loaded properly.

Please feel free to contact me if you have any questions or need any help.

Best regards.

CorrelAidSwitzerland / a4d

Automate pipeline #85