Open bendnorman opened 2 years ago
I think all of the constants are being referred to in dict_from_id()
methods. Here is a list of where all of the constants are being used:
Resource.dict_from_id()
RESOURCE_METADATA
FIELD_METADATA_BY_GROUP
FIELD_METADATA_BY_RESOURCE
SOURCES
FOREIGN_KEYS
Encoder.dict_from_id()
RESOURCE_METADATA
Encoder.from_code_id()
CODE_METADATA
Encode.to_rst()
JINJA_ENVIRONMENT
RESOURCE_METADATA
Field.dict_from_id()
FIELD_METADATA
License.dict_from_id()
LICENSES
: Will licenses be shared across projects? Contributor.dict_from_id()
CONTRIBUTORS
: Will contributors be shared across projects?DataSource.dict_from_id()
SOURCES
DataSource.from_field_namespace()
SOURCES
What data stored in these constants should be shared across projects? Contributors Licenses? What about field and code metadata? This kind of gets at the type of client work we take on. If client data integration is related to PUDL it should just be included in PUDL. IF the data sources are completely separate from PUDL (like in DBCP) it makes less sense to share field and resource metadata between the projects.
I'm still leaning towards making these constants class variables for the metadata classes. Other projects can import the classes, subclass them and add new constants as class variables to the subclasses. Something like this:
import abc
class License(Base, abc.ABC):
"""
Data license (`package|resource.licenses[...]`).
See https://specs.frictionlessdata.io/data-package/#licenses.
"""
name: String
title: String
path: HttpUrl
@property
@abc.abstractmethod
def LICENSES(cls) -> List: # noqa: N805
"""Abstract LISCENSES class variable."""
return cls.LICENSES
@classmethod
def dict_from_id(cls, x: str) -> dict:
"""Construct dictionary from PUDL identifier."""
return copy.deepcopy(cls.LICENSES[x])
@classmethod
def from_id(cls, x: str) -> "License":
"""Construct from PUDL identifier."""
return cls(**cls.dict_from_id(x))
import pudl
from pudl.metadata.constants import LICENSES
class PUDLLicense(License):
LICENSES: ClassVar = LICENSES
PUDLLicense.from_id("cc-by-4.0")
Storing the constants as class variables makes sense to me. The LICENSES class variable defines all of the licenses PUDLLicense.from_id()
can access.
How do we ensure all of the metadata.classes subclasses are referring to the same constants? For example, there wouldn't be anything stopping you from subclassing DataSource and Resource with different SOURCE
constants.
All of our metadata classes are supposed to work together. For example, the Schema class has a foreign_key
attribute but the foreign key relationships are efforced in the Package class.
Maybe the Package class should store the constants given it is the highest level abstraction? I don't know exactly what this would look like but the constants could be stored in the Package class and passed down to the lower abstractions like Resource and Field?
IDK, am I overthinking this? Do you have any ideas @zaneselvans?
The classes in
pudl.metadata.classes
are great for controlling all things metadata in PUDL. I'd like to be able to use classes like Resource, Field, and Package in our other projects like DBCP. DBCP currently uses pandera to store metadata. Unfortunately,pudl.metadata.classes
imports a lot of PUDL specific constants likeFIELD_METADATA
andRESOURCE_METADATA
. This makes it difficult for DBCP to use the classes with new metadata.This feels like a decently high priority so DBCP can leverage PUDL's metadata tools prior to adding new datasets and improving existing data.
There are a few submodules under
pudl.metadata
that contain PUDL metadata constants:constants
(these are less PUDL specific)codes
fields
resources
sources
How can we redesign
pudl.metadata.classes
so other projects can use these classes?RESOURCE_METADATA
an instance variable for classes likeResource
andDataSource
. This would be very flexible but would require folks to specify theresource_metadata
attribute for every Resource.pudl.metadata.classes
classes and specify class variables likeRESOURCE_METADATA
orFIELD_METADATA
. This way folks wouldn't need to specify the metadata constants very every instance. Treating these constants as instance or class variables would not enforce the existing metadata project structure we use.