catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Simplify bundle and datapackage organization #900

Closed rousik closed 2 years ago

rousik commented 3 years ago

Right now, the ETL is configured to run a bundle that consists one or more datapackages that contain one or more datasets. This allows cross-dataset dependencies (e.g. epacems needs to load a specific table from eia) and could potentially allow us to run series of different configurations within a single bundle.

While this gives us a lot of flexibility it is unclear whether this flexibility is needed. It also increases code complexity.

Examples of problems with the current setup:

I propose simpler layout where dataset can be configured at most once in the settings file and pipeline will ensure that all dependencies are satisfied (e.g. when running epacems, we also need to run eia). Presumably, we will generate single datapackage.json per dataset (with the exception of epacems that does not generate datapackages) and we should load all datapackages into pudl database as part of the ETL run.

rousik commented 3 years ago

Currently we support custom naming of datapackages, e.g. we will store them under datapkg/$bundle_name/${datapkg_bundle_settings.name}. Perhaps this can be simplified so that we generate datapkg/${bundle_name}/${dataset}

rousik commented 3 years ago

And, from configuration layout standpoint, it might also make sense to move ferc1_to_sqlite settings under ferc1 dataset configuration. While this does not strictly tell us how to transform the data, it does control the extraction aspect in a same way years/tables would control other datasets.

rousik commented 3 years ago

Couple more thoughts. While it makes perfect sense to configure each dataset once per ETL run, there's still an open question about how to construct datapackages, i.e. what datasets should be combined into single datapackage and how to name this datapackage (e.g. right now ferc1 is standalone while ei923 and eia860 (plus perhaps some others) are processed with entity extraction and merged into single datapackage.

rousik commented 3 years ago

While we are there, perhaps it might make sense to indicate which of the datasets/tables should also be loaded into pudl database (e.g. epacems currently should not be loaded into pudl database because it is publishing into parquet).

zaneselvans commented 3 years ago

If we are going to have additional datapackage based releases, I think it would totally make sense to specify each dataset only once, with a single set of settings, and then separately (in the same settings file) describe how those are supposed to be organized into different datapackages,

But since we're talking about skipping the datapackage output entirely, and having the ETL directly populate a database and a collection of parquet files, it might make more sense to talk about how we want to specify an ETL run with those outputs in mind.

rousik commented 3 years ago

The way I like to think about these things is that there are two distinct parts that come together:

  1. what should be processed and how (dataset configuration)
  2. where should we place the outputs (gcs, local disk)

A single data configuration (say, the standard data release) should specify the first part but depending on where we run the ETL (on a local machine, as part of github action or on cloud as part of data release) should specify how/where should the results of the pipeline be stored.

There are certain gray-areas, such as defining which dataset should be dumped into sqlite and which should be turned into parquet files. Right now, this is kind of hard-coded in the codebase (epacems gets stored in parquet and the rest (but not epacems eia part) should be put into datapackages (eiaNNN will be combined into single big datapackage) that should be merged and dumped into sqlite.

rousik commented 3 years ago

If we want to abandon datapackages and load things directly into the database, that might be quite a bit simpler, assuming that the dataframes can be uploaded to sql directly (I'm assuming we will need to combine the raw data with a schema of some sorts)

rousik commented 3 years ago

Right now I'm trying to come up with an interim solution, that would retain the current datapackage creation behavior. There are two options:

  1. each dataset can optionally emit a datapackage containing its tables
  2. datasets are separate top-level entity and lists which datasets it should contain

(2) is more flexible as it allows combining more than one dataset into a package but apart from epacems this doesn't seem to be actually in use (and epacems also no longer does this because it emits parquet files directly) so I'm leaning towards the (1).

Here's a sample settings.yml fragment:

Option 1

datapkg_bundle_name: foo-bar
datasets:
  - ferc1:
      outputs:
        load_into_pudl_db: true  # should these tables go into pudl.sqlite?
        datapackage:
          name: foo-bar-ferc1
          title: ... 
          description: ...
          version: ...

Option 2:

datapkg_bundle_name: foo-bar
datasets:
  - ferc1:
    ...
datapackages:
  - foo-bar-ferc1
      title: ...
      description: ...
      version: ...
      datasets:
        - ferc1

The option 2 is more complex and also adds a difficulty to csv writing. When a single dataset is part of more than one datapackage, do we write it to disk twice?

zaneselvans commented 2 years ago

Given that we are moving to direct output of SQLite/DB + Parquet datasets, I think this is no longer an issue specifically, though we'll still need to talk about how to specify an ETL run.