catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Update PUDL to use Pydantic 2 #2842

Closed zaneselvans closed 7 months ago

zaneselvans commented 9 months ago

We can't switch over to using Pydantic 2 until Dagster supports it but I wanted a place to collect tasks for once it's supported.

- [x] Run the [`bump-pydantic`](https://github.com/pydantic/bump-pydantic) tool.
- [x] Fix doctest failure in `pudl.metadata.classes._validator` or obsolete the need for this custom validator getter.
- [x] `pytest test/unit/transform/classes_test.py::test_multicol_transform_factory` gets dtype mismatches
- [x] Fix or work around failing `PudlPaths()` validations when input/output dirs are None
- [x] Replace all instances of `model.dict()` with `model.model_dump()`
- [x] Replace all instances of the within-model `Config` classes with the `config = ConfigDict` attribute.
- [x] Remove unnecessary `fields_` and `schema_` aliases b/c v2 uses `model_fields` and `model_schema`.
- [x] Use pydantic strict types directly instead of aliasing in `pudl.metadata.classes`
- [x] Replace all instances of `update_forward_refs` with `model_rebuild()`
- [x] Use `re.Pattern` and `Annotated` types for the simplest models in `pudl.metadata.classes`
- [x] replace use of `__fields_set__` with `model_fields_set`
- [x] replace use of `__fields__` with `model_fields`
- [x] Replace all instances of `@validator` with `@field_validator`
- [x] Suppress warning about shadowing deprecated `BaseModel.schema` attribute
- [x] Fix `calc_cols` + `parent_cols` validation errors in exploded tables
- [x] Replace all instances of `@root_validator` with `@model_validator`
- [x] Add `extra`, `validate_all`, and `validate_assignment` back to metadata class configs.
- [x] Update release notes.
### Issues to Create
- [ ] Use simple `Annotated` types to validate single-element FERC 1 `TransformParams`.
- [x] Update `ferc-xbrl-extractor` to use Pydantic v2
- [x] Update `pudl-archiver` to use Pydantic v2
zaneselvans commented 7 months ago

I'm currently getting an error in the conversion of our settings classes into Dagster configs, I think because of the taxonomy URL being part of our settings. But that element is getting a type of Any in the conversion right now, so not sure why Dagster is still finding it to be a URL:

TypeError: Object of type Url is not JSON serializable

I was trying to look at converting our settings classes over to being Dagster Configs directly, but it seems like there are only two high-level Dagster Configs that we're constructing, one for all of the FERC to SQLite conversions, and one of all the rest of the datasets. Is that what we want? Would that mean we need to consolidate all of our existing ETL settings together into a single class?

zaneselvans commented 7 months ago

It seems like we just have to make all of the AnyHttpUrl attributes into strings to be compatible with the Dagster configs.

bendnorman commented 7 months ago

The current set up I created for sure feels awkward / not usable with Pydantic 2.0 and the new dagster Config based on pydantic. I made a DatasetsSettings object as a Resource so any asset can access it. I also wrote the create_dagster_config() function to convert the DatasetSettings class into a Legacy dagster run configuration so we could configure the dataset_settings resource at run time. While a bit janky, it felt like the only way to translate our existing ETL configuration into dagster world.

I think it makes sense to convert our pydnatic settings classes directly into dagster config now that dagster configs can be pydantic classes. No need to have legacy configuration passed to a pydantic resource to do validation dagster's legacy config system didn't support.

I think we can do this using nested dagster configurations. Something like this:


from dagster import asset, Config, RunConfig

class EpaCemsSettings(Config):
    years: list[int]
    states: list[str]

class DatasetsSettings(Config):
    eia: EiaSettings
    epacems: EpaCemsSettings
    ...

@asset
def extract_eia860(config: DatasetsSettings):
      years = config.eia.eia860.years
      ...

I wonder if it still make sense to have all of our dataset configs to live one DatasetsSettings class so that assets can access all dataset configurations. I think we did this originally because there was some existing functionality that expected all of the data source config. I'll take a look.

bendnorman commented 7 months ago

Ah but if we don't keep a configurable Resource around the syntax for configuring the assets gets weird:

assets:
   extract_eia860:
       config:
           years: [2020,2021]
   glue:
       config:
           years: [2020,2021]
   ...
   # every asset that depends on dataset configuration
bendnorman commented 7 months ago

Ok I finally got through all of the new ConfigurableResource and new Config documentation. I also went through all of our assets that depend on the datasets_settings resource. Most of the assets depend on config specific to the data source it's processing. For example extract_ferc714 uses dataset_settings.ferc714 to figure out which years to process. However, there is configuration information that needs to be shared among multiple assets. For example, dataset_settings.eia.eia860.eia860m is referenced in multiple assets so it doesn't make sense to have asset-specific configuration.

I think we should keep all of our dataset configurations as a single resource. It seems like we can create nested ConfigurableResources. I was able to create a toy example of what this might look like in a dagster question I posted that I quickly answered myself 🤦

I need to play around with this a bit more but I must eat dinner now :)

zaneselvans commented 7 months ago

Okay so... a Config is a per-asset setting, while a Resource or ConfigurableResource is globally accessible through the context? And because we have a few cases in which there's ETL setting information that pertains to one dataset, but that needs to be accessed by multiple assets, we need to use a ConfigurableResource?

Do we actually want these to be configurable at run time in Launchpad? Or should they always be read from the settings YAML files?

I've found that when I touch configurations in launchpad, the changes stick around and get messed up, with extra copies of individual years and other weirdness. Don't we pretty much always want to run the DAG based on either the fast or the full ETL settings? Having a GUI that each of us could have twiddled and thus end up running totally different configurations that aren't linked back to what's in the repo seems like a recipe for confusion.