Add caravan dataset - Githubissues

Daafip commented 6 months ago

'Caravan is an open community dataset of meteorological forcing data, catchment attributes, and discharge data for catchments around the world'. With a notable publication .

Would be nice to easily access this as a standardised dataset in eWaterCycle. Currently you need to download a 12gb zip file and split the files yourself. Is there a way to nicely integrate this?

'The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited.' As far as I understand we are allowed to redistribute it?

BSchilperoort commented 6 months ago

Nice that the dataset is available on Zenodo now. When I chatted with Frederik (the PI) last EGU, you basically had to go though Google Earth Engine to get the data.

It is a bit frustrating that everything is in a single zip file though. It would have been nice if the parts were more split up.

For integration with eWaterCycle I see one main hurdle:

dataset of meteorological forcing data, catchment attributes, and discharge data

This combines eWaterCycle's separate forcing, parameter sets and observations in one block. So if you'd want to integrate this it would have to be split up into a CaravanForcing, a caravan ParameterSet definition, and a caravan.py module in observations.

Of course this is still completely viable.

We do not have a ZenodoDownloader implemented, but there is a placeholder here.

As far as I understand we are allowed to redistribute it?

Yes. But we don't need to if we would implement a downloader. The users themselves would be downloading it.

Daafip commented 6 months ago

The naming scheme is also different (to be expected), what worked for me to locally use camels files.

date is used as dimension rather than time.

ds = ds.rename_dims({'date': 'time'})
ds = ds.rename({'date': 'time'})

pr,pev etc. all have different names. To get HBV to run:

RENAME_CAMELS = {'total_precipitation_sum':'pr',
                'potential_evaporation_sum':'pev',
                'streamflow':'Q'} 
ds = ds.rename(RENAME_CAMELS)

BSchilperoort commented 6 months ago

I had a look at the caravan data. The catchments have a file per catchment, and the attributes are in separate files. Each netCDF file does not use the variable attributes, the units are instead defined in the general attrs... :face_with_spiral_eyes:

It would be possible to reorganize this, move the variable attributes to the proper locations, and merge the separate basin files in a single netCDF (per camel). On a new "basin" dimension, you can then add the basin's ID as coordinates, as well as the metadata as additional variables.

The netCDF files are also not optimally compressed. I was able to compress it to 38% of the original netCDF size. So all Caravan netCDF files would only be about 6.3 GB.

I think before adding the caravan dataset to eWaterCycle, we'd have to go through the following steps:

Properly format the netCDF files (attributes, add metadata)
Merge them per CAMELS dataset (e.g. CAMELS-GB becomes one netCDF).
Upload these netCDF files to data.4tu.nl, so that they can be accessed using OPeNDAP.
Write the code for eWaterCycle (which should then be very simple and straightforward).

To then get the data for a CAMELS basin, all that's needed is:

def get_camels(dataset: str, basin_id: str):
    ds = xr.open_dataset(f"https://data.4tu.nl/.../{dataset}")
    return ds.sel(basin=basin_id)

BSchilperoort commented 6 months ago

It was faster to just write the conversion notebook than to discuss this/think of when to do this.

Here's the notebook: https://gist.github.com/BSchilperoort/256751fe2ea060c50b103f72026590a2

Now we'd just need to upload it to https://data.4tu.nl (along with the shapefiles as well perhaps...). However the free limit for non-associated dutch researchers is 5 GB/year, and the data will be just over 6 GB. Once I have the TU Delft guest account I can upload it.

Daafip commented 6 months ago

Now we'd just need to upload it to https://data.4tu.nl (along with the shapefiles as well perhaps...). However the free limit for non-associated dutch researchers is 5 GB/year,

I got around this as a student: https://data.4tu.nl/collections/bf0eaf7c-f2fa-46f6-b8cd-77ad939dd350

Daafip commented 6 months ago

Waiting now for my submission of the data to be approved. Once that goes through I'll:

[x] Link the dataset to the collection
[x] Add another dataset with the shape files
[x] Link to the notebook mentioned above correctly

BSchilperoort commented 6 months ago

By the way, there's also a GRCD extension to Caravan now. https://zenodo.org/records/8425587

Daafip commented 6 months ago

Now on the OpenDAP server: Kratzert, Frederik; Schilperoort, Bart; Haasnoot, David ; Hut, R.W. (Rolf) (2024): Caravan - A global community dataset for large-sample hydrology. Version 1. 4TU.ResearchData. dataset. https://doi.org/10.4121/ca13056c-c347-4a27-b320-930c2a4dd207

Daafip commented 6 months ago

To then get the data for a CAMELS basin, all that's needed is:
def get_camels(dataset: str, basin_id: str):
    ds = xr.open_dataset(f"https://data.4tu.nl/.../{dataset}")
    return ds.sel(basin=basin_id)

Works for me using:

def get_camels(dataset: str, basin_id: str):
    ds = xr.open_dataset(f"https://opendap.4tu.nl/thredds/dodsC/data2/djht/ca13056c-c347-4a27-b320-930c2a4dd207/1/{dataset}.nc")
    return ds.sel(basin_id=basin_id.encode())

Daafip commented 5 months ago

Now availible in the main branch, one downside (of the whole dataset) is it can be dificult to find which basin_id you actually want as a user. Now would include looking through the dataset and or downloading the shapefile manually. Could easily incoperate somthing like a webmap/follium map. See this example I made previously for KNMI weather stations: this.

Daafip commented 5 months ago

Can be found here: https://github.com/Daafip/caravan-map

BSchilperoort commented 5 months ago

Can be found here: https://github.com/Daafip/caravan-map

Nice! Can't you host it on github pages?

Daafip commented 5 months ago

Can't you host it on github pages?

Hadnt thought of that

Daafip commented 5 months ago

Nice! Can't you host it on github pages?

https://daafip.github.io/caravan-map/

BSchilperoort commented 5 months ago

Thanks for adding this, David.

Peter and Stefan liked the map view a lot, and see potential in it to use it to make a simpler "click on the basin, get an ewatercycle notebook with forcing generation + model(s)" interface (something eWaterCycle used to have), but that's for some other time.

eWaterCycle / ewatercycle

Add caravan dataset #398