esgf2-us / intake-esgf

Programmatic access to the ESGF holdings
https://intake-esgf.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
9 stars 5 forks source link

Proposal: Add Pangeo/ESGF Cloud Zarr data as search-index. #44

Open jbusecke opened 2 months ago

jbusecke commented 2 months ago

Loved to discover this tool on the ESGF meeting.

I would like to follow up on the suggestion of @nocollier to add the data ingested as part of the Pangeo/ESGF Cloud working group

What we have.

Requirements:

nocollier commented 2 months ago

At the moment I am only downloading but that was not my long term intention. In my mind we would have an interface such as to_dataset_dict(prefer_stream=True) or something similar. Then as we look at the responses at various access methods, instead of triggering the download we would just pass the appropriate handle to xarray. I say prefer because you may have datasets in your catalog without stream options and I wouldn't want things to fail. If you wanted to be sure you only ever streamed data, then you could add access='kerchunk' or similar to your search to get only records with a particular access method. What do you think?

jbusecke commented 2 months ago

Could we just selectively bypass the _move_data step? This could be paired with passing some kwargs to the xarray open logic.

nocollier commented 2 months ago

Sorry we are dragging our heels a bit. The problem is that we need to refactor to_dataset_dict() and don't have time just now. I just need time with some other projects I have been ignoring prior to the ESGF meeting and then this is top priority.

I wrote the package without looking too deeply at intake-esm and also mainly thinking about download. So to_dataset_dict() grew into something that is messy and harder to refactor. In my mind a rework is coming that would:

That is the plan in my mind--just need some get clear of some other things to work on this and I need Max to help me reconcile what intake-esm is doing. If you have suggestions/comments particular on the interface, I would welcome them.

nocollier commented 2 months ago

It also strikes me that while I see no compelling reason not to include all indices as options, that ESGF project managers may push back because data hosted elsewhere is not quality controlled. Of course, even our own holdings have lots of issues, but I think they are more concerned with version updates. The messaging queue in the works will help with this. However, I would counter that by including these community-built indices as options, we give their maintainers a way to compare with other data in the ESGF index and stay better up to date. Just another concern to think about.

jbusecke commented 2 months ago

This all sounds great! And for my part this is not immediately time sensitive! I would also be happy with providing the index as a plug in (and warning the user accordingly, that this is not 'official' esgf data).

In pseudocode:

from intake-esgf.index import CSVIndex
from intake-esgf import ESGFCatalog

csv_index = CSVIndex('s3://path/to/csv_file', ...)
cat = ESGFCatalog(custom_index=csv_index, ...)
# this will always issue a warning when using non-official indicies

Separately from that it would be nice to get our data 'checked and approved' to that we can fix all the things necessary and then maybe become an official index, but that is a secondary goal IMO.

Please feel free to ping me if you find some cycles and Id be happy to help with this if that would be useful.