It seems sensible to fix this upstream so that we have consistency at all stages of the data pipeline. The serialisation of metadata structs (classes, whatever) to dataframes is at:
I think the cleanest way we can do this is to add aliases to the fields:
class DataPublisher(BaseModel):
...
description: str = Field(
description="A brief description of the organisation publishing the data, including its mandate.",
serialization_alias="data_publisher_description"
)
...
And then using md.model_dump(by_alias=True).
This way when instantiating the classes in Python we don't need to type data_publisher_description.
The one small downside of this is that the generation of IDs (hashing) will not uses the field aliases, just the original names. This could be confusing if we ever want to verify the IDs by re-hashing a class, but I think for the most part this is unlikely and the IDs can just be opaque. We should still add a comment to point this out anyway.
Right now, there are multiple fields named e.g.
description
, which leads to some annoyance when joining dataframes of different metadata in that the fields have to be renamed: https://github.com/Urban-Analytics-Technology-Platform/popgetter-cli/pull/38#discussion_r1630832625It seems sensible to fix this upstream so that we have consistency at all stages of the data pipeline. The serialisation of metadata structs (classes, whatever) to dataframes is at:
https://github.com/Urban-Analytics-Technology-Platform/popgetter/blob/577e269faee652544ef1d38e7fe167e549ad6e48/python/popgetter/metadata.py#L32-L39
I think the cleanest way we can do this is to add aliases to the fields:
And then using
md.model_dump(by_alias=True)
.This way when instantiating the classes in Python we don't need to type
data_publisher_description
.The one small downside of this is that the generation of IDs (hashing) will not uses the field aliases, just the original names. This could be confusing if we ever want to verify the IDs by re-hashing a class, but I think for the most part this is unlikely and the IDs can just be opaque. We should still add a comment to point this out anyway.