Welcome to the codebase for the Cal-ITP data warehouse and ETL pipeline.
Documentation for this codebase lives at docs.calitp.org/data-infra
This repository uses pre-commit hooks to format code, including Black. This ensures baseline consistency in code formatting.
[!IMPORTANT]
Before contributing to this project, please install pre-commit locally by runningpip install pre-commit
andpre-commit install
in the root of the repo.
Once installed, pre-commit checks will run before you can make commits locally. If a pre-commit check fails, it will need to be addressed before you can make your commit. Many formatting issues are fixed automatically within the pre-commit actions, so check the changes made by pre-commit on failure -- they may have automatically addressed the issues that caused the failure, in which case you can simply re-add the files, re-attempt the commit, and the checks will then succeed.
Installing pre-commit locally saves time dealing with formatting issues on pull requests. There is a GitHub Action that runs pre-commit on all files, not just changed ones, as part of our continuous integration.
[!NOTE]
SQLFluff is currently disabled in the CI run due to flakiness, but it will still lint any SQL files you attempt to commit locally. You will need to manually correct SQLFluff errors because we found that SQLFluff's automated fixes could be too aggressive and could change the meaning and function of affected code.
main
branch back into a PR branch to update it. Instead, rebase PR branches to update them and resolve any merge conflicts.We encourage mypy compliance for Python when possible, though we do not currently run mypy on Airflow DAGs. All service and job images do pass mypy, which runs in the GitHub Actions that build the individual images. If you are unfamiliar with Python type hints or mypy, the following documentation links will prove useful.
In general, it should be relatively easy to make most of our code pass mypy
since we make heavy use of Pydantic types. Some of our imported modules will
need to be ignored with # type: ignore
on import, such as gcsfs
and shapely
(until stubs are available, if ever). We recommend including
comments where additional asserts or other weird-looking code exist to make mypy
happy.
Generally we try to configure things via environment variables. In the Kubernetes world, these get configured via Kustomize overlays (example). For Airflow jobs, we currently use hosted Google Cloud Composer which has a user interface for editing environment variables. These environment variables also have to be injected into pod operators as needed via Gusty YAML or similar. If you are running Airflow locally, the docker compose file needs to contain appropriately set environment variables.