fsspec / kerchunk

Cloud-friendly access to archival data
https://fsspec.github.io/kerchunk/
MIT License
298 stars 79 forks source link

Loading xarray based on kerchunke'd catalogue may sometime get NaN from unstable http access? #253

Closed tinaok closed 1 year ago

tinaok commented 1 year ago

I'm trying to use CMIP6 NetCDF datas on the fly, from http server at ESGF

They expose data on opendap , download_url and gridftp.
There are some limitation due to opendap service so I'm trying to use kerchunk to open the data through download_url.

Here is the notebook showing the problem.

Kerchunk work perfect, Xarray recognise the file with chunks. But when starting to access from multiple dask workers to access the data, some data are missing, and for each try, it is not the same data missing.

I downloaded the NetcdF in question to local s3 bucket and did same operation, no data missing.

My guess is probably server is not responding time to time, (but I do not know )

I would like to stay with access through kerchunk directory from the server, and avoid doing wget to all the file we will analyse.

Is there anyway to

Thank you for your help

martindurant commented 1 year ago

Which exact version of fsspec do you have?

There are a couple of subtle things interacting here. Yes, zarr interprets not-found to mean that the key doesn't exist, and so fills it with the fill value, NaN by default. There is nothing that xarray can do about this, it is handed an already complete array by zarr.

The referenceFS knows, though, which keys ought to exist. Recently, we made a change to make sure that, if the final backend (http here) report an exception, it gets back to zarr and then raised. If you have the latest version and this isn't happening, then either the change doesn't work, or the error from http is already FileNotFound-like. It would be useful to create the filesystem instance directly (fsspec.filesystem("reference", fo=info_http)) and do fs.cat on a bunch of keys, to see the kinds of errors that sometimes come back.

We should maybe make an exception in referenceFS which explicitly says "this key is known in the set of references, but failed to load", whatever exception the backend returned.

As for retries, the http backend does not have there. The underlying aiohttp client will retry some specific problems, but if the server does respond and returns a 500 code, say, then that is the end of it. Since both s3fs and gscfs have explicit retry logic, this would make sense. The only method that needs it from xarray/kerchunk's point of view is ._cat_file.