Proposal: Add Pangeo/ESGF Cloud Zarr data as search-index.

jbusecke commented 2 months ago

Loved to discover this tool on the ESGF meeting.

I would like to follow up on the suggestion of @nocollier to add the data ingested as part of the Pangeo/ESGF Cloud working group

What we have.

Basically an intake-esm csv file, that has all the facets and a store (pointing to a gcs url). The file is fully public.

Requirements:

As I hear @nocollier speak on this I realize that currently you are (always?) caching to a local disk? Can we make this optional and enable just opening a dataset lazily?

nocollier commented 2 months ago

At the moment I am only downloading but that was not my long term intention. In my mind we would have an interface such as to_dataset_dict(prefer_stream=True) or something similar. Then as we look at the responses at various access methods, instead of triggering the download we would just pass the appropriate handle to xarray. I say prefer because you may have datasets in your catalog without stream options and I wouldn't want things to fail. If you wanted to be sure you only ever streamed data, then you could add access='kerchunk' or similar to your search to get only records with a particular access method. What do you think?

jbusecke commented 2 months ago

Could we just selectively bypass the _move_data step? This could be paired with passing some kwargs to the xarray open logic.

nocollier commented 2 months ago

Sorry we are dragging our heels a bit. The problem is that we need to refactor to_dataset_dict() and don't have time just now. I just need time with some other projects I have been ignoring prior to the ESGF meeting and then this is top priority.

I wrote the package without looking too deeply at intake-esm and also mainly thinking about download. So to_dataset_dict() grew into something that is messy and harder to refactor. In my mind a rework is coming that would:

Allow for the intake-esm indices to be configured as any other index currently supported. You may wish to include them as options, you may wish to have those be the only index you use.
Provide options to allow streaming. Maybe to_dataset_dict(prefer_streaming=True) with a configuration intake_esgf.conf.set(streaming_preference=['zarr','kerchunk','opendap']). I say prefer because some datasets in your index records may not have streaming access. Perhaps there is another option to set if you only ever want to stream and error if no streaming link is available.
Provide a cat.to_local_file_list() interface for those who need to download but either don't want to use xarray or need the local file locations for another reason. Our IPSL collaborators have no internet access on their HPC and so they must download and then use a special transfer method to get files where they need to be. This would let them use this and push data to their resources.
Provide a cat.to_link_list() for users like yourself, who don't like whatever default we embed in to_dataset_dict() and may just want the link to the kerchunk file (or whatever, I am still green with these technologies) so they can handle the call to xarray as they wish.

That is the plan in my mind--just need some get clear of some other things to work on this and I need Max to help me reconcile what intake-esm is doing. If you have suggestions/comments particular on the interface, I would welcome them.

nocollier commented 2 months ago

It also strikes me that while I see no compelling reason not to include all indices as options, that ESGF project managers may push back because data hosted elsewhere is not quality controlled. Of course, even our own holdings have lots of issues, but I think they are more concerned with version updates. The messaging queue in the works will help with this. However, I would counter that by including these community-built indices as options, we give their maintainers a way to compare with other data in the ESGF index and stay better up to date. Just another concern to think about.

jbusecke commented 2 months ago

This all sounds great! And for my part this is not immediately time sensitive! I would also be happy with providing the index as a plug in (and warning the user accordingly, that this is not 'official' esgf data).

In pseudocode:

from intake-esgf.index import CSVIndex
from intake-esgf import ESGFCatalog

csv_index = CSVIndex('s3://path/to/csv_file', ...)
cat = ESGFCatalog(custom_index=csv_index, ...)
# this will always issue a warning when using non-official indicies

Separately from that it would be nice to get our data 'checked and approved' to that we can fix all the things necessary and then maybe become an official index, but that is a secondary goal IMO.

Please feel free to ping me if you find some cycles and Id be happy to help with this if that would be useful.

esgf2-us / intake-esgf

Proposal: Add Pangeo/ESGF Cloud Zarr data as search-index. #44

What we have.

Requirements: