Deltares / hydromt

HydroMT: Automated and reproducible model building and analysis
https://deltares.github.io/hydromt/
MIT License
66 stars 28 forks source link

replace placeholders with extended variant concept in `DataCatalog` #965

Open DirkEilander opened 1 month ago

DirkEilander commented 1 month ago

Kind of request

Changing existing functionality

Enhancement Description

Use case

We should discuss if we want to merge these concept to have a simpler interface for users

Additional Context

No response

Jaapel commented 3 weeks ago

Suggested yaml:

my_nice_source:
  data_type: DataFrame  # not editable by variants
  ...
  variant_keys:
    - metadata.provider
    - metadata.crs
    - driver.filesystem
  variants:
    - uri: s3://bucket/key1/key2.json  # required for all variants.
      metadata:
        crs: 4326
        provider: organisation1
      driver:
        filesystem: s3
    - uri: /mnt/p/cooldata.json
      metadata:
        crs: 90002
        provider: organization2
      driver:
        filesystem: local
        default_variant: True

Where variant_keys are keys that uniquely define the variant, which should be present in each variant definition. Other fields like uri can overwrite the source definition. dots in variant_keys define nested fields. If no variant is requested a the default variant is used, which is flagged by the default_variant key. All variants should be of the same datatype, hence this field cannot be overwritten, but all other fields can be overwritten.

DirkEilander commented 3 weeks ago

Also discussed: DataCatalog._sources should become a dictionary of lists with all variants (instead of a nested dict currently) where we find the requested variant based on filtering. To request a specific variant a dictionary with source name and variant keys and associated values is given to the data_like argument in DataCatalog.get_rasterdataset (and similar) methods, see below. If now unique variant is found an error is raised.

da = data_catalog.get_rasterdataset(
    data_like = {"source": "my_nice_source", "metadata.crs": 4326},
    ...
)

In addition to the yaml format above which specifies variant_keys that are already existing keys of the the data source, it should also be possible to define new keys. This can already be added to metadata in the current setup, but we could also create a specific variant field in DataSource. I suggest that keys specified in the variant field don't need a section prefix to keep requesting data as above short.

my_nice_source:
  variant_keys:
    - name
  variants:
    - uri: s3://bucket/key1/key2.json
      variant:
        name: key2
    - uri: /mnt/p/cooldata.json
      variant:
        name: cooldata
DirkEilander commented 3 weeks ago

@hboisgon We would like to also get your feedback on this issue. With this new variant concept I think we have a single (before we had variant, alias and placeholder), but flexible way to define multiple variants of the same source. For the cmip6 model archive it would require a longer catalog yaml file, but with more flexibility to accommodate small differences between files in terms of format.