catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Generalize metadata classes for use in other Catalyst projects #1522

Open bendnorman opened 2 years ago

bendnorman commented 2 years ago

The classes in pudl.metadata.classes are great for controlling all things metadata in PUDL. I'd like to be able to use classes like Resource, Field, and Package in our other projects like DBCP. DBCP currently uses pandera to store metadata. Unfortunately, pudl.metadata.classes imports a lot of PUDL specific constants like FIELD_METADATA and RESOURCE_METADATA. This makes it difficult for DBCP to use the classes with new metadata.

This feels like a decently high priority so DBCP can leverage PUDL's metadata tools prior to adding new datasets and improving existing data.

There are a few submodules under pudl.metadatathat contain PUDL metadata constants:

How can we redesign pudl.metadata.classes so other projects can use these classes?

bendnorman commented 2 years ago

Constants Usage

I think all of the constants are being referred to in dict_from_id() methods. Here is a list of where all of the constants are being used:

What data stored in these constants should be shared across projects? Contributors Licenses? What about field and code metadata? This kind of gets at the type of client work we take on. If client data integration is related to PUDL it should just be included in PUDL. IF the data sources are completely separate from PUDL (like in DBCP) it makes less sense to share field and resource metadata between the projects.

Ideas for generalizing

Class Variables

I'm still leaning towards making these constants class variables for the metadata classes. Other projects can import the classes, subclass them and add new constants as class variables to the subclasses. Something like this:

import abc

class License(Base, abc.ABC):
    """
    Data license (`package|resource.licenses[...]`).

    See https://specs.frictionlessdata.io/data-package/#licenses.
    """

    name: String
    title: String
    path: HttpUrl

    @property
    @abc.abstractmethod
    def LICENSES(cls) -> List:  # noqa: N805
        """Abstract LISCENSES class variable."""
        return cls.LICENSES

    @classmethod
    def dict_from_id(cls, x: str) -> dict:
        """Construct dictionary from PUDL identifier."""
        return copy.deepcopy(cls.LICENSES[x])

    @classmethod
    def from_id(cls, x: str) -> "License":
        """Construct from PUDL identifier."""
        return cls(**cls.dict_from_id(x))
import pudl
from pudl.metadata.constants import LICENSES

class PUDLLicense(License):
    LICENSES: ClassVar = LICENSES

PUDLLicense.from_id("cc-by-4.0")

Storing the constants as class variables makes sense to me. The LICENSES class variable defines all of the licenses PUDLLicense.from_id() can access.

How do we ensure all of the metadata.classes subclasses are referring to the same constants? For example, there wouldn't be anything stopping you from subclassing DataSource and Resource with different SOURCE constants.

Move all of the data to Package

All of our metadata classes are supposed to work together. For example, the Schema class has a foreign_key attribute but the foreign key relationships are efforced in the Package class.

Maybe the Package class should store the constants given it is the highest level abstraction? I don't know exactly what this would look like but the constants could be stored in the Package class and passed down to the lower abstractions like Resource and Field?

IDK, am I overthinking this? Do you have any ideas @zaneselvans?