Urban-Analytics-Technology-Platform / poppusher

https://popgetter.readthedocs.io/en/latest/
Apache License 2.0
7 stars 1 forks source link

Disambiguate field names in metadata structs #109

Closed penelopeysm closed 4 months ago

penelopeysm commented 5 months ago

Right now, there are multiple fields named e.g. description, which leads to some annoyance when joining dataframes of different metadata in that the fields have to be renamed: https://github.com/Urban-Analytics-Technology-Platform/popgetter-cli/pull/38#discussion_r1630832625

It seems sensible to fix this upstream so that we have consistency at all stages of the data pipeline. The serialisation of metadata structs (classes, whatever) to dataframes is at:

https://github.com/Urban-Analytics-Technology-Platform/popgetter/blob/577e269faee652544ef1d38e7fe167e549ad6e48/python/popgetter/metadata.py#L32-L39

I think the cleanest way we can do this is to add aliases to the fields:

class DataPublisher(BaseModel):
    ...
    description: str = Field(
        description="A brief description of the organisation publishing the data, including its mandate.",
        serialization_alias="data_publisher_description"
    )
    ...

And then using md.model_dump(by_alias=True).

This way when instantiating the classes in Python we don't need to type data_publisher_description.

The one small downside of this is that the generation of IDs (hashing) will not uses the field aliases, just the original names. This could be confusing if we ever want to verify the IDs by re-hashing a class, but I think for the most part this is unlikely and the IDs can just be opaque. We should still add a comment to point this out anyway.

penelopeysm commented 4 months ago

Testing that this works will, unfortunately, require regenerating all the data!