To contribute to this project:
upstream
. This can be done with the command git remote add upstream git@github.com:SatcherInstitute/<repo>.git
When you're ready to make changes:
upstream
, run git pull upstream master
Note that there are a few downsides to "Squash and merge"
my_branch_1
and you want to start work on a new PR that is dependent on my_branch_1
, you can do the following:
my_branch_2
based on my_branch_1
. Continue to develop on my_branch_2
.my_branch_1
is updated (including by merging changes from master), switch to my_branch_2
and run git rebase -i my_branch_1
to incorporate the changes into my_branch_2
while maintaining the the branch dependency.my_branch_1
. Don't delete my_branch_1
yet.git rebase --onto master my_branch_1 my_branch_2
. This tells git to move all the commits between my_branch_1
and my_branch_2
onto master. You can now delete my_branch_1
.Read more about the forking workflow here. For details on "Squash and merge" see here
Install Cloud SDK (Quickstart)
Install Terraform (Getting started)
Install Docker Desktop (Get Docker)
gcloud config set project <project-id>
python3 -m venv .venv
source .venv/bin/activate
pip install pip-tools
This may be useful either:
Although a function can be created via the gcloud functions deploy
command, there are some options you need to configure the first time it is deployed. It is much easier to create the function from the cloud console, and then use the command line to deploy source code updates.
Once a function is created, to deploy it from the command line:
main.py
function is ingcloud functions deploy fn_name
Note that this deploys the contents of the current directory to the cloud function specified by fn_name. Be careful as this will overwrite the contents of fn_name
with the contents of the current directory. You can use this for testing and development by deploying the source code to a test function.
To change configuration details, you have to specify these options in the deploy
command. For example:
--entry-point
option.--trigger-topic
option.A full list of options can be found here. Changing configuration of the function is usually easier from the cloud console UI.
To test a Cloud Function or Cloud Run service triggered by a Pub/Sub topic, run
gcloud pubsub topics publish projects/<project-id>/topics/<your_topic_name> --message "your_message"
'data'
property of the event. Note that this method will work for the upload to GCS function or service, which expects to read information from the 'data'
field. The GCS-to-BQ function or service expects to read from the 'attributes'
field, so the --attribute
flag should be used instead. See Documentation for details.
For example, you can use the following command to trigger ingestion for the list of state names and state codes (note that backslashes are required on Windows because Windows is weird and messes up serialization if you don't. OS X or Linux may not require backslashes, I'm not sure).
gcloud pubsub topics publish projects/temporary-sandbox-290223/topics/{upload_to_gcs_topic_name} --message "{\"id\":\"STATE_NAMES\", \"url\":\"https://api.census.gov/data/2010/dec/sf1\", \"gcs_bucket\":{gcs_landing_bucket}, \"filename\":\"state_names.json\"}"
where upload_to_gcs_topic_name
and gcs_landing_bucket
are the same as the terraform variables of the same name
Most python code should go in the /python
directory, which contains packages that can be installed into any service. Each sub-directory of /python
is a package with an __init__.py
file, a setup.py
file, and a requirements.in
file. Shared code should go in one of these packages. If a new sub-package is added:
Create a folder /python/<new_package>
. Inside, add:
__init__.py
filesetup.py
file with options: name=<new_package>
, package_dir={'<new_package>': ''}
, and packages=['<new_package>']
requirements.in
file with the necessary dependenciesFor each service that depends on /python/<new_package>
, follow instructions at Adding an internal dependency
To work with the code locally, run pip install ./python/<package>
from the root project directory. If your IDE complains about imports after changing code in /python
, re-run pip install ./python/<package>
.
Note: the /python
directory has three root-level files that aren't necessary: main.py
, requirements.in
, and requirements.txt
. These exist purely so the whole /python
directory can be deployed as a cloud function, in case people are relying on that for development/quick iteration. Due to limitations with cloud functions, these files have to exist directly in the root folder. We should eventually remove these.
requirements.in
file.
/python/<package>
, add it to the /python/<package>/requirements.in
file.<service_directory>/requirements.in
file./python/<package>
this means every service that depends on /python/<package>
):
cd <service_directory>
, then pip-compile requirements.in
where <service_directory>
is the root-level directory for the service. This will generate a requirements.txt
file.pip install -r requirements.txt
to ensure your local environment has the dependencies, or run pip install <new_dep>
directly. Note, you'll first need to have followed the python environment setup described above Python environment setup.If a service adds a dependency on /python/<some_package>
:
-r ../python/<some_package>/requirements.in
to the <service_directory>/requirements.in
file. This will ensure that any deps needed for the package get installed for the service.requirements.txt
files.RUN pip install ./python/<some_package>
to <service_directory>/Dockerfile
The Cloud Code plugin for VS Code and JetBrains IDEs lets you locally run and debug your container image in a Cloud Run emulator within your IDE. The emulator allows you configure an environment that is representative of your service running on Cloud Run.
Cloud Code
entry should be added to the bottom toolbar of your editor.Run on Cloud Run emulator
option will begin the process of setting up the configuration for your Cloud Run service.gcr.io/<PROJECT_ID>/<NAME>
Docker
and the correct Dockerfile path is selected, prototype/run_ingestion/Dockerfile
Automatically re-build and re-run on changes
checkbox is selected for hot reloading.After your Docker container successfully builds and is running locally you can start sending requests.
Send curl requests in the following format:
DATA=$(printf '{"id":<INGESTION_ID>,"url":<INGESTION_URL>,"gcs_bucket":<BUCKET_NAME>,"filename":<FILE_NAME>}' |base64) && curl --header "Content-Type: application/json" -d '{"message":{"data":"'$DATA'"}}' http://localhost:8080
launch.json
file, set the configuration->service->serviceAccountName
attribute to the service account email you just created.Before deploying, make sure you have installed Terraform and a Docker client (e.g. Docker Desktop). See One time setup above.
Create your own terraform.tfvars
file in the same directory as the other terraform files. For each variable declared in prototype_variables.tf
that doesn't have a default, add your own for testing. Typically your own variables should be unique and can just be prefixed with your name or ldap. There are some that have specific requirements like project ids, code paths, and image paths.
Configure docker to use credentials through gcloud.
gcloud auth configure-docker
On the command line, navigate to your project directory and initialize terraform.
cd path/to/your/project
terraform init
Build and push your Docker images to Google Container Registry. Select any unique identifier for your-[ingestion|gcs-to-bq]-image-name
.
# Build the images locally
docker build -t gcr.io/<project-id>/<your-ingestion-image-name> -f run_ingestion/Dockerfile .
docker build -t gcr.io/<project-id>/<your-gcs-to-bq-image-name> -f run_gcs_to_bq/Dockerfile .
# Upload the image to Google Container Registry
docker push gcr.io/<project-id>/<your-ingestion-image-name>
docker push gcr.io/<project-id>/<your-gcs-to-bq-image-name>
Deploy via Terraform.
# Get the latest image digests
export TF_VAR_ingestion_image_name=$(gcloud container images describe gcr.io/<project-id>/<your-ingestion-image-name> \
--format="value(image_summary.digest)")
export TF_VAR_gcs_to_bq_image_name=$(gcloud container images describe gcr.io/<project-id>/<your-gcs-to-bq-image-name> \
--format="value(image_summary.digest)")
# Deploy via terraform, providing the paths to the latest images so it knows to redeploy
terraform apply -var="ingestion_image_name=<your-ingestion-image-name>@$TF_VAR_ingestion_image_name" \
-var="gcs_to_bq_image_name=<your-gcs-to-bq-image-name>@$TF_VAR_gcs_to_bq_image_name"
Alternatively, if you aren't familiar with bash or are on Windows, you can run the above gcloud container images describe
commands manually and copy/paste the output into your tfvars file for the ingestion_image_name
and gcs_to_bq_image_name
variables.
To redeploy, e.g. after making changes to a Cloud Run service, repeat steps 4-5. Make sure you run the commands from your base project dir.
Currently the setup deploys both a cloud funtion and a cloud run instance for each pipeline. These are duplicates of each other. Eventually, we will delete the cloud fuctions, but for now you can just comment out the setup for whichever one you don't want to use in prototype.tf
Terraform doesn't automatically diff the contents of the functions/cloud run service, so simply calling terraform apply
after making code changes won't upload your new changes. This is why Steps 4 and 5 are needed above. Here are several alternatives:
terraform taint
to mark a resource as requiring redeploy. Eg terraform taint google_cloud_run_service.ingestion_service
run_ingestion_image_path
variable in your tfvars file to gcr.io/<project-id>/<your-ingestion-image-name>
and run_gcs_to_bq_image_path
to gcr.io/<project-id>/<your-gcs-to-bq-image-name>
. Then replace Step 5 above with just terraform apply
. Step 4 is still required.terraform taint
and then terraform apply
terraform destroy
every time before terraform apply
. This is slow but a good way to start from a clean slate. Note that this doesn't remove old container images so it doesn't help for Cloud Run services.