catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Update from `frictionless>=4.4,<5` to `frictionless>=5,<6` across the board #3293

Open jdangerx opened 5 months ago

jdangerx commented 5 months ago

Once #1420 is closed, we will be using one version of frictionless across the board.

Unfortunately, that version will be 4.40.8, which is missing critical SQL-describing functionality.

We should upgrade to version 5! Migration guide is here.

Fortunately, our existing usage of the frictionless libraries is very limited - just checking if it can parse a datapackage.json and then checking to see if the metadata is valid:

ferc-xbrl-extractor/src/ferc-xbrl-extractor/xbrl.py

    if datapackage_path:
        # Verify that datapackage descriptor is valid before outputting
        frictionless_package = Package(descriptor=datapackage.dict(by_alias=True))
        if not frictionless_package.metadata_valid:
            raise RuntimeError(
                f"Generated datapackage is invalid - {frictionless_package.metadata_errors}"
            )

pudl/src/pudl/workspace/datastore.py

        dp = frictionless.Package(datapackage_json)
        if not dp.metadata_validate():

Which is pretty easy to change to the frictionless 5 Package.validate_descriptor() method.

Unfortunately, this is partly an artifact of us making mirror classes that look sort of like the frictionless classes, in pudl-archiver, ferc-xbrl-extractor, and pudl. These classes sort of replicate the frictionless classes, but also include lots of custom logic for our own purposes.

My proposal is:

  1. rename our Fake Frictionless classes to PudlResource, XbrlResource, and ArchiverResource, respectively (plus the Package/Schema/etc. equivalents)
  2. add to_frictionless methods that allow us to convert these to their frictionless counterparts
  3. Use real frictionless classes to serialize datapackage.jsons that are valid, in ferc-xbrl-extractor/pudl-archiver
  4. Create new, valid archives if necessary
  5. Use real frictionless classes to validate the datapackages in pudl/workspace/datastore.py

My guess for this work is ~20h.

Note - since our classes have required functionality that frictionless classes don't have, we should not implement from_frictionless methods. For example, every XbrlResource has to respond to get_fact_tables(), but frictionless has no such notion. So then from_frictionless can only ever make some half-functional XbrlResource - so we shouldn't do that at all.

bendnorman commented 1 week ago

I'm new to this issue but why not have our Resources be subclasses of the frictionless framework classes? I think we'll have 3 types of extensions we'd like to make:

  1. extensions to frictionless classes that make sense to add directly to the frictionless-py package
  2. extensions that don't fit into ^ but we'd like to use in all of our projects. For example, we want to store additional metadata for all of our projects that don't fit into the frictionless specification. Another example would be if we want a method to convert a frictionless schema to a pandera schema but frictionless isn't interested in providing that functionality. To handle these types of extensions we'd probably need a Catalyst specific package that extends the frictionless tools.
  3. extensions that are project specific (XBRL, PUDL, client projects...) For this we can just extend the catalyst or frictionless package.
jdangerx commented 1 week ago

I think it's probably better to have our Resource contain references to frictionless classes - roughly corresponding to the idea of preferring "composition over inheritance."

The benefit of this, to me, is that this reduces the coupling between our PudlResource interface and the implementation of the frictionless classes it depends on. You'll only need to know the public frictionless APIs instead of the implementation details. But we would still be able to use the frictionless code to reduce duplication as necessary.