Opening remote datasets not lazy with fsspec

fsspec / filesystem_spec

A specification that python filesystems should adhere to.

BSD 3-Clause "New" or "Revised" License

990 stars 347 forks source link

Opening remote datasets not lazy with fsspec #374

Open tjcrone opened 4 years ago

tjcrone commented 4 years ago

I'm finding that the following code reads all of the remote datasets into memory:

fs = gcsfs.GCSFileSystem(project='ldeo-glaciology', mode='ab')
filelist = fs.ls('gs://ldeo-glaciology/AMPS/wrf_d03_20161222_week-cf')

arrays = []
for filename in filelist:
    with  fsspec.open('gs://' + filename, mode='rb') as openfile:  
        arrays.append(xr.open_dataset(openfile, chunks = {}))

Is this the expected behavior? Is there a way to open these files as a remote dataset without fully reading them into memory? Thanks!

martindurant commented 4 years ago

I believe chunks = {} is supposed to not read the data, but make a graph for Dask; but I don't know how it really works. Certainly, because of the with ..., the file will be closed after the context block closes, so it would not be readable any more.

I guess these are netCDF/HDF5 files? It may require reading a large fraction of the file just to find all the bits of metadata, which are not necessarily grouped nicely together at the head of the file. gcsfs tries to cache with a read-ahead policy by default, so even reading little bits of metadata can lead to larger reads (but usually not a long time compared to the connection establishment overhead).

I hope this helps, but it would be good to ask on xarray too, since I don't really know how such files work. Doing logging from gcsfs to show the bytes ranges that are being requested and actually fetched might paint a clearer picture.

rsignell-usgs commented 4 years ago

@tjcrone it seems you have a collection of NetCDF files that each contain one time step, and you would like to merge them all together, right? Usually one would use xr.open_mfdataset to a collection like this, but as Martin says, it takes quite a while to read the metadata, and then even longer for for xarray to do all the checks to make sure they are okay to aggregate.

One solution is to avoid this altogether and use dask delayed as in cell [60] of this example notebook by @rabernat.

A better solution if you want to read more than once would be to recast this collection as a zarr dataset, appropriately chunked, perhaps using the new rechunker package ( by @rabernat and @TomAugspurger )

This type of conversion is the goal of the new pangeo forge project, example here: https://github.com/pangeo-forge/pangeo-forge/blob/master/examples/oisst-avhrr-v02r01.py

tjcrone commented 4 years ago

Thank you for these helpful ideas @martindurant. I will do some logging with gcsfs to see if I can figure out what is going on. And yes, this topic might go better on Xarray. Thanks for that suggestion.

tjcrone commented 4 years ago

@rsignell-usgs, thank you for these suggestions. Wrapping in delayed() is an interesting approach but I would think that Xarray/Dask would handle a use-case like this automagically. I will see if I can figure out why it does not work here. Rebuilding this dataset as Zarr using rechunker(!) is a fantastic idea. Thanks!