Deltares / hydromt

HydroMT: Automated and reproducible model building and analysis
https://deltares.github.io/hydromt/
MIT License
73 stars 29 forks source link

Simplify data drivers #720

Closed savente93 closed 4 months ago

savente93 commented 9 months ago

Kind of request

Currently, DataAdapters are responsible for both the representation of different data sources in the DataCatalog, reading in the data and transforming the data to a uniform data representation in memory. This makes the class responsible for a lot of functions and hard to modify or extend by the plugins.

Enhancement Description

We propose that a Driver should be responsible for reading the data and creating a memory representation, while the Adapter should do generic transformations and filtering/slicing. A DataSource should represent items in the DataCatalog, which can check at read time whether all the required fields are present.

Use case

This should make testing and maintenance easier, while being more flexible to customize for plugins.

Additional Context

No response

Jaapel commented 9 months ago

Posting https://github.com/Deltares/hydromt/issues/432 here for reference discussions

Jaapel commented 9 months ago

Look at this DataCatalog entry:

gtsm_codec_reanalysis_{freq}_v1:
  crs: 4326
  data_type: GeoDataset
  driver: netcdf
  kwargs:
    chunks:
      stations: 10
      time: -1
  meta:
    category: ocean
    paper_doi: 10.3389/fmars.2020.00263
    paper_ref: Muis at al (2020)
    source_license: https://cds.climate.copernicus.eu/api/v2/terms/static/licence-to-use-copernicus-products.pdf
    source_url: https://doi.org/10.24381/cds.8c59054f
    source_version: v1
  path: p:/11205028-c3s_435/01_data/01_Timeseries/timeseries2/{variable}/reanalysis_{variable}_{freq}_{year}_{month:02d}_v1.nc
  placeholders:
    freq: [10min, hourly, dailymax]
  rename:
    station_x_coordinate: lon
    station_y_coordinate: lat
    stations: index

There is a placeholder in the title entry, which can easily be expanded using the placeholders entry in the yaml document. But what about the path entry? There the freq is coming back, but there are also year, month and variable. For RasterDataset there is also zoom_level. Is this some generic behavior for certain datasets or can we just use this in the more generic DataSource classes (e.g. RasterDataSource? want to split out the _resolve_paths between what the driver should be responsible for (remote data access -> filesystem.glob) and what is generic over multiple different data sources (handling name conventions). @DirkEilander @hboisgon do you think these naming conventions are truly generic? So far the logic seems to be to fill in and capture these "known" placeholders and if you do not recognize the placeholder, place *. Did I miss any behavior?

hboisgon commented 9 months ago

Let me try to answer: placeholder is really different from the others because it helps to define multiple data sources (from the same dataset) that have exactly the same reading attributes apart from the path. The best example where we use this is cmip6 data where we can define in one data catalog entry 23*2 data sources:

cmip6_{model}_historical_{member}_{timestep}:
  crs: 4326
  data_type: RasterDataset
  driver: zarr
  filesystem: gcs
  kwargs:
    drop_variables: [time_bnds, lat_bnds, lon_bnds, bnds]
    decode_times: true
    preprocess: harmonise_dims
    consolidated: true
  meta:
    category: climate
    paper_doi: 10.1175/BAMS-D-11-00094.1
    paper_ref: Taylor et al. 2012
    source_license: CC BY 4.0
    source_url: https://console.cloud.google.com/marketplace/details/noaa-public/cmip6?_ga=2.136097265.-1784288694.1541379221&pli=1
    source_version: 1.3.1
  placeholders:
    model: [IPSL/IPSL-CM6A-LR, SNU/SAM0-UNICON, NCAR/CESM2, NCAR/CESM2-WACCM, INM/INM-CM4-8, INM/INM-CM5-0, NOAA-GFDL/GFDL-ESM4, NCC/NorESM2-LM, NIMS-KMA/KACE-1-0-G,
      CAS/FGOALS-f3-L, CSIRO-ARCCSS/ACCESS-CM2, NCC/NorESM2-MM, CSIRO/ACCESS-ESM1-5, NCAR/CESM2-WACCM-FV2, NCAR/CESM2-FV2, CMCC/CMCC-CM2-SR5, AS-RCEC/TaiESM1,
      NCC/NorCPM1, IPSL/IPSL-CM5A2-INCA, CMCC/CMCC-CM2-HR4, CMCC/CMCC-ESM2, IPSL/IPSL-CM6A-LR-INCA, E3SM-Project/E3SM-1-0]
    member: [r1i1p1f1]
    timestep: [day, Amon]
  path: gs://cmip6/CMIP6/CMIP/{model}/historical/{member}/{timestep}/{variable}/*/*
  rename:
    pr: precip
    tas: temp
    rsds: kin
    psl: press_msl
  unit_add:
    temp: -273.15
  unit_mult:
    precip: 86400
    press_msl: 0.01

So placeholder is really something that would be true for all of the DataSource types and all placeholders keywords should findable in the path.

The rest are "known" keywords in the path that hydromt can use to directly slice data when reading a data source. For example in some get_data methods you can pass time_tuple (then uses year, month keywords if present) or variables list (then uses the variable keyword if present). In the case of your example or ERA5, if you request in get_data to only get precipitation for a year, this allows hydromt to read only one file precip_2001.nc instead of all netcdf files for all years and all variables before slicing (so faster and potentially less memory consumption).

But then like zoom_level all these keywords may not be applicable to all types of DataSource. Not sure by heart which applies to which but basically you can check the drivers arguments and see if you can pass to it time_tuple, variables and/or zoom_level.

Maybe one final example to try and understand the difference between placeholder and known keywords:

# Placeholders have to be replaced in the data source name to get the data and keywords can be passed in the get_data ethods arguments
data_catalog.get_geodataset("gtsm_codec_reanalysis_hourly_v1", variables = ["precip"], time_tuple=("2010-01-01", "2010-03-31"))
# Get the 10min version of the dataset instead for all times and variables
data_catalog.get_geodataset("gtsm_codec_reanalysis_10min_v1")
DirkEilander commented 9 months ago

In addition to @hboisgon. The placeholders are solved when parsing the data catalog, The path format arguments are checked in the resolve path and should be part of the new Driver class as some drivers will need a driver-specific resolve path method (e.g. tiled datasets without vrt such as the copernicus dem on s3 example).

We can discuss whether the placeholder architecture can be replaced by an extended implementation of the variants this might be more clear to users. It would result in slightly longer data catalog files but more flexibility (e.g. some driver kwargs can be specific to one variant). Currently the variants only support version and provider and I'm not sure how easy it is to generalize this. @savente93 @hboisgon Is this worth exploring? Anyway, this is another topic.

hboisgon commented 9 months ago

I was wondering the same if we could replace placeholder with variants. Maybe worth exploring in a new issue (for v1)? If we do it well, data catalogs would be longer but it might make it more easy to understand for the user. So worth exploring I think.

Jaapel commented 9 months ago

So far I intend to place a generic solution with hydromt keywords year, month, variable, zoom_level, at the DataSource level. Drivers can then fill in any {{key}} themselves, as they will get 1 or more URIs. Is the fact that we add an extra { to the key because of some windows (driver specific) reason, or does it have another reason?

savente93 commented 9 months ago

I think there is definitely something to this idea, but I think it would be good to have a (short) design session around this. One thing I think is definitely something we want is to make a distinction between the kinds of place holders since they need to be handled at different times, if I understand correctly. I'm not sure what the correct terminology is, but for now I'll call them data-slice place holders (var=precip) and file path place holders (year/month/feq). One thing I personally find annoying about the current place holder implementation is that it doesn't communicate possible values, such as year or variable in the first example. additionally, especially when dealing with cloud file systems, any processing we can do up front without having to ask the fs for information is going to speed up the process, so if possible I'm in favour of that. So I'm definitely in favour of looking further into using the variants.

DirkEilander commented 9 months ago

So far I intend to place a generic solution with hydromt keywords year, month, variable, zoom_level, at the DataSource level. Drivers can then fill in any {{key}} themselves, as they will get 1 or more URIs.

My thinking to implement the generic resolve path solution at the DataDriver level is so that it can easily be extended/ skipped by custom drivers. E.g., the zoom_level key for instance is only relevant for some RasterDataDrivers, and the filecheck with fsspec also won't work for many custom Drivers that target specific APIs like gww. Just putting this here to keep these use cases in mind, it could well be that these are covered in your approach too.

Is the fact that we add an extra { to the key because of some windows (driver specific) reason, or does it have another reason?

The double { are only used to escape unknown keys. For instance if your path looks like C:/{long-vm-ware-uuid}/merit/{variable}.tif In _resolve_path we first convert this to C:/{{long-vm-ware-uuid}}/my_dataset/{variable}.tif so we can then format this string with variable="my_variable" without getting errors that "long-vm-ware-uuid" is unkonwn or similar.

DirkEilander commented 9 months ago

One thing I think is definitely something we want is to make a distinction between the kinds of place holders since they need to be handled at different times. I'm not sure what the correct terminology is, but for now I'll call them data-slice place holders (var=precip) and file path place holders (year/month/feq).

Just to clarify the discussion. We have HydroMT path keywords. These are solved based on runtime request to only read a slice of the data (currently in DataAdapter, but this will be moved to DataSource/Driver). Currently these keywords are ["year", "month", "zoom_level", "variable"]

Next to this we have placeholders and variants which are concepts to define multiple variants of the same source more easily in the data catalog. These are solved when reading the data catalog and result in unique source items. Placeholders can be anything defined by the user. Variants can only be specified based on version and provider.

The concepts of placeholders and variants could perhaps be merged (to be discussed) and might help to solve the confusion between placeholders and path keywords.

Jaapel commented 7 months ago

HydroMT_v1-Driver + Resolver drawio Proposed for coming refinement

Jaapel commented 7 months ago

HydroMT_v1-Driver + Resolver drawio New version based on discussions

savente93 commented 4 months ago

I think this is resolved with the current driver implementation in v1