Generalize metadata classes for use in other Catalyst projects

The classes in pudl.metadata.classes are great for controlling all things metadata in PUDL. I'd like to be able to use classes like Resource, Field, and Package in our other projects like DBCP. DBCP currently uses pandera to store metadata. Unfortunately, pudl.metadata.classes imports a lot of PUDL specific constants like FIELD_METADATA and RESOURCE_METADATA. This makes it difficult for DBCP to use the classes with new metadata.

This feels like a decently high priority so DBCP can leverage PUDL's metadata tools prior to adding new datasets and improving existing data.

There are a few submodules under pudl.metadatathat contain PUDL metadata constants:

constants (these are less PUDL specific)
codes
fields
resources
sources

How can we redesign pudl.metadata.classes so other projects can use these classes?

We could make constants like RESOURCE_METADATA an instance variable for classes like Resource and DataSource. This would be very flexible but would require folks to specify the resource_metadataattribute for every Resource.
We could make the constants class variables. Folks could subclass pudl.metadata.classes classes and specify class variables like RESOURCE_METADATA or FIELD_METADATA. This way folks wouldn't need to specify the metadata constants very every instance. Treating these constants as instance or class variables would not enforce the existing metadata project structure we use.
... still a WIP

Constants Usage

I think all of the constants are being referred to in dict_from_id() methods. Here is a list of where all of the constants are being used:

Resource.dict_from_id()
- RESOURCE_METADATA
- FIELD_METADATA_BY_GROUP
- FIELD_METADATA_BY_RESOURCE
- SOURCES
- FOREIGN_KEYS
Encoder.dict_from_id()
- RESOURCE_METADATA
Encoder.from_code_id()
- CODE_METADATA
Encode.to_rst()
- JINJA_ENVIRONMENT
- RESOURCE_METADATA
Field.dict_from_id()
- FIELD_METADATA
License.dict_from_id()
- LICENSES: Will licenses be shared across projects?
Contributor.dict_from_id()
- CONTRIBUTORS: Will contributors be shared across projects?
DataSource.dict_from_id()
- SOURCES
DataSource.from_field_namespace()
- SOURCES

What data stored in these constants should be shared across projects? Contributors Licenses? What about field and code metadata? This kind of gets at the type of client work we take on. If client data integration is related to PUDL it should just be included in PUDL. IF the data sources are completely separate from PUDL (like in DBCP) it makes less sense to share field and resource metadata between the projects.

Ideas for generalizing

Class Variables

I'm still leaning towards making these constants class variables for the metadata classes. Other projects can import the classes, subclass them and add new constants as class variables to the subclasses. Something like this:

import abc

class License(Base, abc.ABC):
    """
    Data license (`package|resource.licenses[...]`).

    See https://specs.frictionlessdata.io/data-package/#licenses.
    """

    name: String
    title: String
    path: HttpUrl

    @property
    @abc.abstractmethod
    def LICENSES(cls) -> List:  # noqa: N805
        """Abstract LISCENSES class variable."""
        return cls.LICENSES

    @classmethod
    def dict_from_id(cls, x: str) -> dict:
        """Construct dictionary from PUDL identifier."""
        return copy.deepcopy(cls.LICENSES[x])

    @classmethod
    def from_id(cls, x: str) -> "License":
        """Construct from PUDL identifier."""
        return cls(**cls.dict_from_id(x))

import pudl
from pudl.metadata.constants import LICENSES

class PUDLLicense(License):
    LICENSES: ClassVar = LICENSES

PUDLLicense.from_id("cc-by-4.0")

Storing the constants as class variables makes sense to me. The LICENSES class variable defines all of the licenses PUDLLicense.from_id() can access.

How do we ensure all of the metadata.classes subclasses are referring to the same constants? For example, there wouldn't be anything stopping you from subclassing DataSource and Resource with different SOURCE constants.

Move all of the data to Package

All of our metadata classes are supposed to work together. For example, the Schema class has a foreign_key attribute but the foreign key relationships are efforced in the Package class.

Maybe the Package class should store the constants given it is the highest level abstraction? I don't know exactly what this would look like but the constants could be stored in the Package class and passed down to the lower abstractions like Resource and Field?

IDK, am I overthinking this? Do you have any ideas @zaneselvans?

catalyst-cooperative / pudl