.. readme-intro
.. image:: https://github.com/catalyst-cooperative/mozilla-sec-eia/workflows/tox-pytest/badge.svg :target: https://github.com/catalyst-cooperative/mozilla-sec-eia/actions?query=workflow%3Atox-pytest :alt: Tox-PyTest Status
.. image:: https://img.shields.io/codecov/c/github/catalyst-cooperative/mozilla-sec-eia?style=flat&logo=codecov :target: https://codecov.io/gh/catalyst-cooperative/mozilla-sec-eia :alt: Codecov Test Coverage
.. image:: https://img.shields.io/readthedocs/catalystcoop-mozilla-sec-eia?style=flat&logo=readthedocs :target: https://catalystcoop-mozilla-sec-eia.readthedocs.io/en/latest/ :alt: Read the Docs Build Status
.. image:: https://img.shields.io/pypi/v/catalystcoop.mozilla-sec-eia?style=flat&logo=python :target: https://pypi.org/project/catalystcoop.mozilla-sec-eia/ :alt: PyPI Latest Version
.. image:: https://img.shields.io/conda/vn/conda-forge/catalystcoop.mozilla-sec-eia?style=flat&logo=condaforge :target: https://anaconda.org/conda-forge/catalystcoop.mozilla-sec-eia :alt: conda-forge Version
.. image:: https://img.shields.io/badge/code%20style-black-000000.svg :target: https://github.com/psf/black> :alt: Any color you want, so long as it's black.
The PUDL <https://github.com/catalyst-cooperative/pudl>
project makes US energy data free and open
for all. For more information, see the PUDL repo and website <https://catalyst.coop/pudl/>
.
This repo implements machine learning models which support PUDL. The types of modelling performed here include record linkage between datasets, and extracting structured data from unstructured documents. The outputs of these models then feed into PUDL tables, and are distributed in the PUDL data warehouse.
This repo is split into two main sections, with shared tooling being implemented in
src/mozilla_sec_eia/library
and actual models implemented in
src/mozilla_sec_eia/models
.
Models
^^^^^^
Each model is contained in its own Dagster
code location <https://docs.dagster.io/concepts/code-locations>
. This keeps models
isolated from each other, allowing finetuned dependency management, and provides useful
organization in the Dagster UI. To add a new model, you must create a new python module
in the src/mozilla_sec_eia/models/
directory. This module should define a single
Dagster Definitions
object which can be imported from the top-level of the module.
For reference on how to structure a code location, see
src/mozilla_sec_eia/models/sec10k/
for an example. After creating a new model,
it must be added to
workspace.yaml <https://docs.dagster.io/concepts/code-locations/workspace-files>
.
There are three types of dagster jobs <https://docs.dagster.io/concepts/assets/asset-jobs>
__
expected in a model code location:
mlflow <https://mlflow.org/docs/latest/tracking.html>
__ run backing
them to allow logging results to a tracking server.There are helper functions in src/mozilla_sec_eia/library/model_jobs.py
for
constructing each of these jobs. These functions help to ensure each job will
use the appropriate executor and supply the job with necessary resources.
Library
^^^^^^^
There's generic shared tooling for pudl-models
defined in
src/mozilla_sec_eia/library/
. This includes the helper functions for
constructing dagster jobs discussed above, as well as useful methods for computing
validation metrics, and an interface to our mlflow tracking server integrated with
our tracking server.
MlFlow
""""""
We use a remote mlflow tracking <https://mlflow.org/docs/latest/tracking.html>
__ to aid in the
development and management of pudl-models
. In the mlflow
module, there are
several dagster resources and IO-managers that can be used in any models to allow simple
seamless interface to the server.
.. TODO: Add mlflow resource/io-manager examples
To launch the dagster UI to load all pudl-models
, run the command dagster dev
in the top-level of this repo. This will load the file workspace.yaml
, which points
to each model. You can also work on a single model in isolation by running the command:
dagster dev -m mozilla_sec_eia.models.{your_cool_model}
.
Catalyst Cooperative <https://catalyst.coop>
is a small group of data
wranglers and policy wonks organized as a worker-owned cooperative consultancy.
Our goal is a more just, livable, and sustainable world. We integrate public
data and perform custom analyses to inform public policy (Hire us! <https://catalyst.coop/hire-catalyst>
). Our focus is primarily on mitigating
climate change and improving electric utility regulation in the United States.
Contact Us ^^^^^^^^^^
GitHub Discussions <https://github.com/catalyst-cooperative/pudl/discussions>
__sign up for our email list <https://catalyst.coop/updates/>
__.Office Hours <https://calend.ly/catalyst-cooperative/pudl-office-hours>
__@CatalystCoop <https://twitter.com/CatalystCoop>
__pudl@catalyst.coop <mailto:pudl@catalyst.coop>
__