fsspec / kerchunk

Cloud-friendly access to archival data
https://fsspec.github.io/kerchunk/
MIT License
285 stars 79 forks source link

gridftp #242

Open tinaok opened 1 year ago

tinaok commented 1 year ago

Hello,

We are trying to use a small subset of CMIP6 data from ESGF server. They expose their NetCDF files in different ways

from pyesgf.search import SearchConnection
server='https://esgf-data.dkrz.de/esg-search'
conn = SearchConnection(server, distrib=True)
source_id='CMCC-CM2-HR4'
activity_id='OMIP'
experiment_id='omip2'
variable_id='vmo'
ctx = conn.new_context(
    project='CMIP6',
    source_id=source_id,
    experiment_id=experiment_id,
    variable=variable_id,
    frequency='mon',
)
result = ctx.search()[0]
files = result.file_context().search()
files[35].urls

Which gives

defaultdict(list,
            {'HTTPServer': [('http://esgf-node2.cmcc.it/thredds/fileServer/esg_dataroot/CMIP6/OMIP/CMCC/CMCC-CM2-HR4/omip2/r1i1p1f1/Omon/vmo/gn/v20200226/vmo_Omon_CMCC-CM2-HR4_omip2_r1i1p1f1_gn_200801-201812.nc',
               'application/netcdf')],
             'GridFTP': [('gsiftp://esgf-node2.cmcc.it:2811//esg_dataroot/CMIP6/OMIP/CMCC/CMCC-CM2-HR4/omip2/r1i1p1f1/Omon/vmo/gn/v20200226/vmo_Omon_CMCC-CM2-HR4_omip2_r1i1p1f1_gn_200801-201812.nc',
               'application/gridftp')],
             'OPENDAP': [('http://esgf-node2.cmcc.it/thredds/dodsC/esg_dataroot/CMIP6/OMIP/CMCC/CMCC-CM2-HR4/omip2/r1i1p1f1/Omon/vmo/gn/v20200226/vmo_Omon_CMCC-CM2-HR4_omip2_r1i1p1f1_gn_200801-201812.nc.html',
               'application/opendap-html')],
             'Globus': [('globus:4101e3a0-b7df-11eb-a16a-5fad80e6400b/esg_dataroot/CMIP6/OMIP/CMCC/CMCC-CM2-HR4/omip2/r1i1p1f1/Omon/vmo/gn/v20200226/vmo_Omon_CMCC-CM2-HR4_omip2_r1i1p1f1_gn_200801-201812.nc',
               'Globus')]})

We just need small subset of netcdf file, and I would like to make kerchunk catalogue of it. I can use the HTTPServer link to transform it to kerchunk catalogue, but just out of curiosity, can it also handle 'ftp' or 'open dap' or 'gridftp' ?

martindurant commented 1 year ago
annefou commented 1 year ago

I agree with Tina that being able to support GridFTP would be very nice. GridFTP is well known is some projects (such as Large Hadron Collider or ESGF) and it is a part of the few high- performance data transfer tools.

martindurant commented 1 year ago

I have had a brief look around, and I can find one example of a python gridftp client, which is very old. Presumably, an fsspec backend could be built for it, and that would enable kerchunk and other remote access from python. However, most of what I find seems to refer specifically to Globus, as opposed to general gridftp, in which case presumably https://globus-sdk-python.readthedocs.io/en/stable/ provides everything needed (it looks very complicated!). In any case, the fsspec backend would require development, sorry.

Is anyone using IPFS or other similar technologies? IPFS does already have an fsspec implementation (ipfsspec).