SatcherInstitute / prototype

MIT License
4 stars 4 forks source link

Health Equity Tracker Prototype

Contributing

To contribute to this project:

  1. Fork the repository on github
  2. On your development machine, clone your forked repo and add the official repo as a remote.
    • Tip: by convention, the official repo is added with the name upstream. This can be done with the command git remote add upstream git@github.com:SatcherInstitute/<repo>.git

When you're ready to make changes:

  1. Pull the latest changes from the official repo.
    • Tip: If your official remote is named upstream, run git pull upstream master
  2. Create a local branch, make changes, and commit to your local branch. Repeat until changes are ready for review.
  3. [Optional] Rebase your commits so you have few commits with clear commit messages.
  4. Push your branch to your remote fork, use the github UI to open a pull request (PR), and add reviewer(s).
  5. Push new commits to your remote branch as you respond to reviewer comments.
    • Note: once a PR is under review, don't rebase changes you've already pushed to the PR. This can confuse reviewers.
  6. When ready to submit, use the "Squash and merge" option. This maintains linear history and ensures your entire PR is merged as a single commit, while being simple to use in most cases. If there are conflicts, pull the latest changes from master, merge them into your PR, and try again.

Note that there are a few downsides to "Squash and merge"

Read more about the forking workflow here. For details on "Squash and merge" see here

One-time setup

Install Cloud SDK (Quickstart)
Install Terraform (Getting started)
Install Docker Desktop (Get Docker)

gcloud config set project <project-id>

Python environment setup

  1. Create a virtual environment in your project directory, for example: python3 -m venv .venv
  2. Activate the venv: source .venv/bin/activate
  3. Install pip-tools and other packages as needed: pip install pip-tools

Deploying functions manually

This may be useful either:

Creating a function

Although a function can be created via the gcloud functions deploy command, there are some options you need to configure the first time it is deployed. It is much easier to create the function from the cloud console, and then use the command line to deploy source code updates.

Deploying a function

Once a function is created, to deploy it from the command line:

  1. Navigate to the directory the main.py function is in
  2. Run gcloud functions deploy fn_name

Note that this deploys the contents of the current directory to the cloud function specified by fn_name. Be careful as this will overwrite the contents of fn_name with the contents of the current directory. You can use this for testing and development by deploying the source code to a test function.

Changing function configuration

To change configuration details, you have to specify these options in the deploy command. For example:

A full list of options can be found here. Changing configuration of the function is usually easier from the cloud console UI.

Testing Pub/Sub triggers

To test a Cloud Function or Cloud Run service triggered by a Pub/Sub topic, run gcloud pubsub topics publish projects/<project-id>/topics/<your_topic_name> --message "your_message"

Note that this method will work for the upload to GCS function or service, which expects to read information from the 'data' field. The GCS-to-BQ function or service expects to read from the 'attributes' field, so the --attribute flag should be used instead. See Documentation for details.

Testing example

For example, you can use the following command to trigger ingestion for the list of state names and state codes (note that backslashes are required on Windows because Windows is weird and messes up serialization if you don't. OS X or Linux may not require backslashes, I'm not sure).

gcloud pubsub topics publish projects/temporary-sandbox-290223/topics/{upload_to_gcs_topic_name} --message "{\"id\":\"STATE_NAMES\", \"url\":\"https://api.census.gov/data/2010/dec/sf1\", \"gcs_bucket\":{gcs_landing_bucket}, \"filename\":\"state_names.json\"}"

where upload_to_gcs_topic_name and gcs_landing_bucket are the same as the terraform variables of the same name

Shared python code

Most python code should go in the /python directory, which contains packages that can be installed into any service. Each sub-directory of /python is a package with an __init__.py file, a setup.py file, and a requirements.in file. Shared code should go in one of these packages. If a new sub-package is added:

  1. Create a folder /python/<new_package>. Inside, add:

    • An empty __init__.py file
    • A setup.py file with options: name=<new_package>, package_dir={'<new_package>': ''}, and packages=['<new_package>']
    • A requirements.in file with the necessary dependencies
  2. For each service that depends on /python/<new_package>, follow instructions at Adding an internal dependency

To work with the code locally, run pip install ./python/<package> from the root project directory. If your IDE complains about imports after changing code in /python, re-run pip install ./python/<package>.

Note: the /python directory has three root-level files that aren't necessary: main.py, requirements.in, and requirements.txt. These exist purely so the whole /python directory can be deployed as a cloud function, in case people are relying on that for development/quick iteration. Due to limitations with cloud functions, these files have to exist directly in the root folder. We should eventually remove these.

Adding python dependencies

Adding an external dependency

  1. Add the dependency to the appropriate requirements.in file.
    • If the dependency is used by /python/<package>, add it to the /python/<package>/requirements.in file.
    • If the dependency is used directly by a service, add it to the <service_directory>/requirements.in file.
  2. For each service that needs the dependency (for deps in /python/<package> this means every service that depends on /python/<package>):
    • Run cd <service_directory>, then pip-compile requirements.in where <service_directory> is the root-level directory for the service. This will generate a requirements.txt file.
    • Run pip install -r requirements.txt to ensure your local environment has the dependencies, or run pip install <new_dep> directly. Note, you'll first need to have followed the python environment setup described above Python environment setup.

Adding an internal dependency

If a service adds a dependency on /python/<some_package>:

Cloud Run local testing with an emulator

The Cloud Code plugin for VS Code and JetBrains IDEs lets you locally run and debug your container image in a Cloud Run emulator within your IDE. The emulator allows you configure an environment that is representative of your service running on Cloud Run.

Installation

  1. Install Cloud Run for VS Code or a JetBrains IDE.
  2. Follow the instructions for locally developing and debugging within your IDE.

Running the emulator

  1. After installing the VS Code plugin, a Cloud Code entry should be added to the bottom toolbar of your editor.
  2. Clicking on this and selecting the Run on Cloud Run emulator option will begin the process of setting up the configuration for your Cloud Run service.
  3. Give your service a name
  4. Set the service container image url with the following format: gcr.io/<PROJECT_ID>/<NAME>
  5. Make sure the builder is set to Docker and the correct Dockerfile path is selected, prototype/run_ingestion/Dockerfile
  6. Ensure the Automatically re-build and re-run on changes checkbox is selected for hot reloading.
  7. Click run

Sending requests

After your Docker container successfully builds and is running locally you can start sending requests.

  1. Open a terminal
  2. Send curl requests in the following format:

    DATA=$(printf '{"id":<INGESTION_ID>,"url":<INGESTION_URL>,"gcs_bucket":<BUCKET_NAME>,"filename":<FILE_NAME>}' |base64) && curl --header "Content-Type: application/json" -d '{"message":{"data":"'$DATA'"}}' http://localhost:8080

Accessing Google Cloud Services

  1. Create a service account in Pantheon
  2. Using IAM, grant the appropriate permissions to the service account
  3. Inside the launch.json file, set the configuration->service->serviceAccountName attribute to the service account email you just created.

Deploying your own instance with terraform

Before deploying, make sure you have installed Terraform and a Docker client (e.g. Docker Desktop). See One time setup above.

  1. Create your own terraform.tfvars file in the same directory as the other terraform files. For each variable declared in prototype_variables.tf that doesn't have a default, add your own for testing. Typically your own variables should be unique and can just be prefixed with your name or ldap. There are some that have specific requirements like project ids, code paths, and image paths.

  2. Configure docker to use credentials through gcloud. gcloud auth configure-docker

  3. On the command line, navigate to your project directory and initialize terraform.

    cd path/to/your/project
    terraform init
  4. Build and push your Docker images to Google Container Registry. Select any unique identifier for your-[ingestion|gcs-to-bq]-image-name.

    # Build the images locally
    docker build -t gcr.io/<project-id>/<your-ingestion-image-name> -f run_ingestion/Dockerfile .
    docker build -t gcr.io/<project-id>/<your-gcs-to-bq-image-name> -f run_gcs_to_bq/Dockerfile .
    
    # Upload the image to Google Container Registry
    docker push gcr.io/<project-id>/<your-ingestion-image-name>
    docker push gcr.io/<project-id>/<your-gcs-to-bq-image-name>
  5. Deploy via Terraform.

    # Get the latest image digests
    export TF_VAR_ingestion_image_name=$(gcloud container images describe gcr.io/<project-id>/<your-ingestion-image-name> \
    --format="value(image_summary.digest)")
    export TF_VAR_gcs_to_bq_image_name=$(gcloud container images describe gcr.io/<project-id>/<your-gcs-to-bq-image-name> \
    --format="value(image_summary.digest)")
    
    # Deploy via terraform, providing the paths to the latest images so it knows to redeploy
    terraform apply -var="ingestion_image_name=<your-ingestion-image-name>@$TF_VAR_ingestion_image_name" \
    -var="gcs_to_bq_image_name=<your-gcs-to-bq-image-name>@$TF_VAR_gcs_to_bq_image_name"

    Alternatively, if you aren't familiar with bash or are on Windows, you can run the above gcloud container images describe commands manually and copy/paste the output into your tfvars file for the ingestion_image_name and gcs_to_bq_image_name variables.

  6. To redeploy, e.g. after making changes to a Cloud Run service, repeat steps 4-5. Make sure you run the commands from your base project dir.

Terraform deployment notes

Currently the setup deploys both a cloud funtion and a cloud run instance for each pipeline. These are duplicates of each other. Eventually, we will delete the cloud fuctions, but for now you can just comment out the setup for whichever one you don't want to use in prototype.tf

Terraform doesn't automatically diff the contents of the functions/cloud run service, so simply calling terraform apply after making code changes won't upload your new changes. This is why Steps 4 and 5 are needed above. Here are several alternatives: