Simplify bundle and datapackage organization

rousik commented 3 years ago

Right now, the ETL is configured to run a bundle that consists one or more datapackages that contain one or more datasets. This allows cross-dataset dependencies (e.g. epacems needs to load a specific table from eia) and could potentially allow us to run series of different configurations within a single bundle.

While this gives us a lot of flexibility it is unclear whether this flexibility is needed. It also increases code complexity.

Examples of problems with the current setup:

epacems needs to read some data from eia, because of this, we run eia pipeline twice with the same config (this is wasteful and also increases complexity of settings file)
dataset config is not unique for a given ETL run, there may be multiple different configurations run
if we exclude epacems files from datapackages (because it will be using parquet files), the remaining datapackage will be effectively equivalent to EIA and may not give added value on its own

I propose simpler layout where dataset can be configured at most once in the settings file and pipeline will ensure that all dependencies are satisfied (e.g. when running epacems, we also need to run eia). Presumably, we will generate single datapackage.json per dataset (with the exception of epacems that does not generate datapackages) and we should load all datapackages into pudl database as part of the ETL run.

rousik commented 3 years ago

Currently we support custom naming of datapackages, e.g. we will store them under datapkg/$bundle_name/${datapkg_bundle_settings.name}. Perhaps this can be simplified so that we generate datapkg/${bundle_name}/${dataset}

rousik commented 3 years ago

And, from configuration layout standpoint, it might also make sense to move ferc1_to_sqlite settings under ferc1 dataset configuration. While this does not strictly tell us how to transform the data, it does control the extraction aspect in a same way years/tables would control other datasets.

rousik commented 3 years ago

Couple more thoughts. While it makes perfect sense to configure each dataset once per ETL run, there's still an open question about how to construct datapackages, i.e. what datasets should be combined into single datapackage and how to name this datapackage (e.g. right now ferc1 is standalone while ei923 and eia860 (plus perhaps some others) are processed with entity extraction and merged into single datapackage.

rousik commented 3 years ago

While we are there, perhaps it might make sense to indicate which of the datasets/tables should also be loaded into pudl database (e.g. epacems currently should not be loaded into pudl database because it is publishing into parquet).

zaneselvans commented 3 years ago

If we are going to have additional datapackage based releases, I think it would totally make sense to specify each dataset only once, with a single set of settings, and then separately (in the same settings file) describe how those are supposed to be organized into different datapackages,

But since we're talking about skipping the datapackage output entirely, and having the ETL directly populate a database and a collection of parquet files, it might make more sense to talk about how we want to specify an ETL run with those outputs in mind.

How much flexibility do we actually need in specifying what gets run? Do we need to be able to specify individual tables for say, the EIA 861, or if we're including the EIA 861, do we just want to assume we're doing all of them all the time?
Might we want to specify fsspec output locations for where each of the parquet datasets gets deposited in the end? Or SQLAlchemy database connection URLs for where to put the database outputs?
As we start including derived values in the database, how will we specify which ones get included? Will it be implied by what original data is being processed? Or will it need to be listed explicitly in the settings file?

rousik commented 3 years ago

The way I like to think about these things is that there are two distinct parts that come together:

what should be processed and how (dataset configuration)
where should we place the outputs (gcs, local disk)

A single data configuration (say, the standard data release) should specify the first part but depending on where we run the ETL (on a local machine, as part of github action or on cloud as part of data release) should specify how/where should the results of the pipeline be stored.

There are certain gray-areas, such as defining which dataset should be dumped into sqlite and which should be turned into parquet files. Right now, this is kind of hard-coded in the codebase (epacems gets stored in parquet and the rest (but not epacems eia part) should be put into datapackages (eiaNNN will be combined into single big datapackage) that should be merged and dumped into sqlite.

rousik commented 3 years ago

If we want to abandon datapackages and load things directly into the database, that might be quite a bit simpler, assuming that the dataframes can be uploaded to sql directly (I'm assuming we will need to combine the raw data with a schema of some sorts)

rousik commented 3 years ago

Right now I'm trying to come up with an interim solution, that would retain the current datapackage creation behavior. There are two options:

each dataset can optionally emit a datapackage containing its tables
datasets are separate top-level entity and lists which datasets it should contain

(2) is more flexible as it allows combining more than one dataset into a package but apart from epacems this doesn't seem to be actually in use (and epacems also no longer does this because it emits parquet files directly) so I'm leaning towards the (1).

Here's a sample settings.yml fragment:

Option 1

datapkg_bundle_name: foo-bar
datasets:
  - ferc1:
      outputs:
        load_into_pudl_db: true  # should these tables go into pudl.sqlite?
        datapackage:
          name: foo-bar-ferc1
          title: ... 
          description: ...
          version: ...

Option 2:

datapkg_bundle_name: foo-bar
datasets:
  - ferc1:
    ...
datapackages:
  - foo-bar-ferc1
      title: ...
      description: ...
      version: ...
      datasets:
        - ferc1

The option 2 is more complex and also adds a difficulty to csv writing. When a single dataset is part of more than one datapackage, do we write it to disk twice?

zaneselvans commented 2 years ago

Given that we are moving to direct output of SQLite/DB + Parquet datasets, I think this is no longer an issue specifically, though we'll still need to talk about how to specify an ETL run.

catalyst-cooperative / pudl

Simplify bundle and datapackage organization #900