cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
48 stars 13 forks source link
caltrans government gtfs kubernetes open-data python

data-infra

Welcome to the codebase for the Cal-ITP data warehouse and ETL pipeline.

Documentation for this codebase lives at docs.calitp.org/data-infra

Repository Structure

Contributing

Pre-commit

This repository uses pre-commit hooks to format code, including Black. This ensures baseline consistency in code formatting.

[!IMPORTANT]
Before contributing to this project, please install pre-commit locally by running pip install pre-commit and pre-commit install in the root of the repo.

Once installed, pre-commit checks will run before you can make commits locally. If a pre-commit check fails, it will need to be addressed before you can make your commit. Many formatting issues are fixed automatically within the pre-commit actions, so check the changes made by pre-commit on failure -- they may have automatically addressed the issues that caused the failure, in which case you can simply re-add the files, re-attempt the commit, and the checks will then succeed.

Installing pre-commit locally saves time dealing with formatting issues on pull requests. There is a GitHub Action that runs pre-commit on all files, not just changed ones, as part of our continuous integration.

[!NOTE]
SQLFluff is currently disabled in the CI run due to flakiness, but it will still lint any SQL files you attempt to commit locally. You will need to manually correct SQLFluff errors because we found that SQLFluff's automated fixes could be too aggressive and could change the meaning and function of affected code.

Pull requests

mypy

We encourage mypy compliance for Python when possible, though we do not currently run mypy on Airflow DAGs. All service and job images do pass mypy, which runs in the GitHub Actions that build the individual images. If you are unfamiliar with Python type hints or mypy, the following documentation links will prove useful.

In general, it should be relatively easy to make most of our code pass mypy since we make heavy use of Pydantic types. Some of our imported modules will need to be ignored with # type: ignore on import, such as gcsfs and shapely (until stubs are available, if ever). We recommend including comments where additional asserts or other weird-looking code exist to make mypy happy.

Configuration via Environment Variables

Generally we try to configure things via environment variables. In the Kubernetes world, these get configured via Kustomize overlays (example). For Airflow jobs, we currently use hosted Google Cloud Composer which has a user interface for editing environment variables. These environment variables also have to be injected into pod operators as needed via Gusty YAML or similar. If you are running Airflow locally, the docker compose file needs to contain appropriately set environment variables.