Health Equity Tracker Prototype

Contributing

To contribute to this project:

Fork the repository on github
On your development machine, clone your forked repo and add the official repo as a remote.
- Tip: by convention, the official repo is added with the name upstream. This can be done with the command git remote add upstream git@github.com:SatcherInstitute/<repo>.git

When you're ready to make changes:

Pull the latest changes from the official repo.
- Tip: If your official remote is named upstream, run git pull upstream master
Create a local branch, make changes, and commit to your local branch. Repeat until changes are ready for review.
[Optional] Rebase your commits so you have few commits with clear commit messages.
Push your branch to your remote fork, use the github UI to open a pull request (PR), and add reviewer(s).
Push new commits to your remote branch as you respond to reviewer comments.
- Note: once a PR is under review, don't rebase changes you've already pushed to the PR. This can confuse reviewers.
When ready to submit, use the "Squash and merge" option. This maintains linear history and ensures your entire PR is merged as a single commit, while being simple to use in most cases. If there are conflicts, pull the latest changes from master, merge them into your PR, and try again.

Note that there are a few downsides to "Squash and merge"

The official repo will not show commits from collaborators if the PR is a collaborative branch.
Working off the same branch or a dependent branch duplicates commits on the dependent branch and can cause repeated merge conflicts. To work around this, if you have a PR my_branch_1 and you want to start work on a new PR that is dependent on my_branch_1, you can do the following:
1. Create a new local branch my_branch_2 based on my_branch_1. Continue to develop on my_branch_2.
2. If my_branch_1 is updated (including by merging changes from master), switch to my_branch_2 and run git rebase -i my_branch_1 to incorporate the changes into my_branch_2 while maintaining the the branch dependency.
3. When review is done, squash and merge my_branch_1. Don't delete my_branch_1yet.
4. From local client, go to master branch and pull from master to update the local master branch with the squashed change.
5. From local client, run git rebase --onto master my_branch_1 my_branch_2. This tells git to move all the commits between my_branch_1 and my_branch_2 onto master. You can now delete my_branch_1.

Read more about the forking workflow here. For details on "Squash and merge" see here

One-time setup

Install Cloud SDK (Quickstart)
Install Terraform (Getting started)
Install Docker Desktop (Get Docker)

gcloud config set project <project-id>

Python environment setup

Create a virtual environment in your project directory, for example: python3 -m venv .venv
Activate the venv: source .venv/bin/activate
Install pip-tools and other packages as needed: pip install pip-tools

Deploying functions manually

This may be useful either:

until terraform and fully automated deployment is set up
for manual testing/experimentation. Different cloud functions can be deployed from the same source code, so you can deploy to a test function without affecting any of the other resources.

Creating a function

Although a function can be created via the gcloud functions deploy command, there are some options you need to configure the first time it is deployed. It is much easier to create the function from the cloud console, and then use the command line to deploy source code updates.

Deploying a function

Once a function is created, to deploy it from the command line:

Navigate to the directory the main.py function is in
Run gcloud functions deploy fn_name

Note that this deploys the contents of the current directory to the cloud function specified by fn_name. Be careful as this will overwrite the contents of fn_name with the contents of the current directory. You can use this for testing and development by deploying the source code to a test function.

Changing function configuration

To change configuration details, you have to specify these options in the deploy command. For example:

If you need to change the entrypoint, use the --entry-point option.
If you need to change the trigger topic, use the --trigger-topic option.

A full list of options can be found here. Changing configuration of the function is usually easier from the cloud console UI.

Testing Pub/Sub triggers

To test a Cloud Function or Cloud Run service triggered by a Pub/Sub topic, run gcloud pubsub topics publish projects/<project-id>/topics/<your_topic_name> --message "your_message"

your_topic_name is the name of the topic the function specified as a trigger.
your_message is the json message that will be serialized and passed to the 'data' property of the event.

Note that this method will work for the upload to GCS function or service, which expects to read information from the 'data' field. The GCS-to-BQ function or service expects to read from the 'attributes' field, so the --attribute flag should be used instead. See Documentation for details.

Testing example

For example, you can use the following command to trigger ingestion for the list of state names and state codes (note that backslashes are required on Windows because Windows is weird and messes up serialization if you don't. OS X or Linux may not require backslashes, I'm not sure).

gcloud pubsub topics publish projects/temporary-sandbox-290223/topics/{upload_to_gcs_topic_name} --message "{\"id\":\"STATE_NAMES\", \"url\":\"https://api.census.gov/data/2010/dec/sf1\", \"gcs_bucket\":{gcs_landing_bucket}, \"filename\":\"state_names.json\"}"

where upload_to_gcs_topic_name and gcs_landing_bucket are the same as the terraform variables of the same name

Shared python code

Most python code should go in the /python directory, which contains packages that can be installed into any service. Each sub-directory of /python is a package with an __init__.py file, a setup.py file, and a requirements.in file. Shared code should go in one of these packages. If a new sub-package is added:

Create a folder /python/<new_package>. Inside, add:
- An empty __init__.py file
- A setup.py file with options: name=<new_package>, package_dir={'<new_package>': ''}, and packages=['<new_package>']
- A requirements.in file with the necessary dependencies
For each service that depends on /python/<new_package>, follow instructions at Adding an internal dependency

To work with the code locally, run pip install ./python/<package> from the root project directory. If your IDE complains about imports after changing code in /python, re-run pip install ./python/<package>.

Note: the /python directory has three root-level files that aren't necessary: main.py, requirements.in, and requirements.txt. These exist purely so the whole /python directory can be deployed as a cloud function, in case people are relying on that for development/quick iteration. Due to limitations with cloud functions, these files have to exist directly in the root folder. We should eventually remove these.

Adding python dependencies

Adding an external dependency

Add the dependency to the appropriate requirements.in file.
- If the dependency is used by /python/<package>, add it to the /python/<package>/requirements.in file.
- If the dependency is used directly by a service, add it to the <service_directory>/requirements.in file.
For each service that needs the dependency (for deps in /python/<package> this means every service that depends on /python/<package>):
- Run cd <service_directory>, then pip-compile requirements.in where <service_directory> is the root-level directory for the service. This will generate a requirements.txt file.
- Run pip install -r requirements.txt to ensure your local environment has the dependencies, or run pip install <new_dep> directly. Note, you'll first need to have followed the python environment setup described above Python environment setup.

Adding an internal dependency

If a service adds a dependency on /python/<some_package>:

Add -r ../python/<some_package>/requirements.in to the <service_directory>/requirements.in file. This will ensure that any deps needed for the package get installed for the service.
Follow step 2 of Adding an external dependency to generate the relevant requirements.txt files.
Add the line RUN pip install ./python/<some_package> to <service_directory>/Dockerfile

Cloud Run local testing with an emulator

The Cloud Code plugin for VS Code and JetBrains IDEs lets you locally run and debug your container image in a Cloud Run emulator within your IDE. The emulator allows you configure an environment that is representative of your service running on Cloud Run.

Installation

Install Cloud Run for VS Code or a JetBrains IDE.
Follow the instructions for locally developing and debugging within your IDE.
- VS Code: Locally developing and debugging
- IntelliJ: Locally developing and debugging

Running the emulator

After installing the VS Code plugin, a Cloud Code entry should be added to the bottom toolbar of your editor.
Clicking on this and selecting the Run on Cloud Run emulator option will begin the process of setting up the configuration for your Cloud Run service.
Give your service a name
Set the service container image url with the following format: gcr.io/<PROJECT_ID>/<NAME>
Make sure the builder is set to Docker and the correct Dockerfile path is selected, prototype/run_ingestion/Dockerfile
Ensure the Automatically re-build and re-run on changes checkbox is selected for hot reloading.
Click run

Sending requests

After your Docker container successfully builds and is running locally you can start sending requests.

Open a terminal
Send curl requests in the following format:

DATA=$(printf '{"id":<INGESTION_ID>,"url":<INGESTION_URL>,"gcs_bucket":<BUCKET_NAME>,"filename":<FILE_NAME>}' |base64) && curl --header "Content-Type: application/json" -d '{"message":{"data":"'$DATA'"}}' http://localhost:8080

Accessing Google Cloud Services

Create a service account in Pantheon
Using IAM, grant the appropriate permissions to the service account
Inside the launch.json file, set the configuration->service->serviceAccountName attribute to the service account email you just created.

Deploying your own instance with terraform

Before deploying, make sure you have installed Terraform and a Docker client (e.g. Docker Desktop). See One time setup above.

Create your own terraform.tfvars file in the same directory as the other terraform files. For each variable declared in prototype_variables.tf that doesn't have a default, add your own for testing. Typically your own variables should be unique and can just be prefixed with your name or ldap. There are some that have specific requirements like project ids, code paths, and image paths.
Configure docker to use credentials through gcloud. gcloud auth configure-docker
On the command line, navigate to your project directory and initialize terraform.
```
cd path/to/your/project
terraform init
```

Build and push your Docker images to Google Container Registry. Select any unique identifier for your-[ingestion|gcs-to-bq]-image-name.

# Build the images locally
docker build -t gcr.io/<project-id>/<your-ingestion-image-name> -f run_ingestion/Dockerfile .
docker build -t gcr.io/<project-id>/<your-gcs-to-bq-image-name> -f run_gcs_to_bq/Dockerfile .

# Upload the image to Google Container Registry
docker push gcr.io/<project-id>/<your-ingestion-image-name>
docker push gcr.io/<project-id>/<your-gcs-to-bq-image-name>

Deploy via Terraform.

# Get the latest image digests
export TF_VAR_ingestion_image_name=$(gcloud container images describe gcr.io/<project-id>/<your-ingestion-image-name> \
--format="value(image_summary.digest)")
export TF_VAR_gcs_to_bq_image_name=$(gcloud container images describe gcr.io/<project-id>/<your-gcs-to-bq-image-name> \
--format="value(image_summary.digest)")

# Deploy via terraform, providing the paths to the latest images so it knows to redeploy
terraform apply -var="ingestion_image_name=<your-ingestion-image-name>@$TF_VAR_ingestion_image_name" \
-var="gcs_to_bq_image_name=<your-gcs-to-bq-image-name>@$TF_VAR_gcs_to_bq_image_name"

Alternatively, if you aren't familiar with bash or are on Windows, you can run the above gcloud container images describe commands manually and copy/paste the output into your tfvars file for the ingestion_image_name and gcs_to_bq_image_name variables.

To redeploy, e.g. after making changes to a Cloud Run service, repeat steps 4-5. Make sure you run the commands from your base project dir.

Terraform deployment notes

Currently the setup deploys both a cloud funtion and a cloud run instance for each pipeline. These are duplicates of each other. Eventually, we will delete the cloud fuctions, but for now you can just comment out the setup for whichever one you don't want to use in prototype.tf

Terraform doesn't automatically diff the contents of the functions/cloud run service, so simply calling terraform apply after making code changes won't upload your new changes. This is why Steps 4 and 5 are needed above. Here are several alternatives:

Use terraform taint to mark a resource as requiring redeploy. Eg terraform taint google_cloud_run_service.ingestion_service
- For Cloud Run, you can then set the run_ingestion_image_path variable in your tfvars file to gcr.io/<project-id>/<your-ingestion-image-name> and run_gcs_to_bq_image_path to gcr.io/<project-id>/<your-gcs-to-bq-image-name>. Then replace Step 5 above with just terraform apply. Step 4 is still required.
- For Cloud Functions, no extra work is needed, just run terraform taint and then terraform apply
For Cloud Functions, call terraform destroy every time before terraform apply. This is slow but a good way to start from a clean slate. Note that this doesn't remove old container images so it doesn't help for Cloud Run services.

SatcherInstitute / prototype

readme