fsspec / filesystem_spec

A specification that python filesystems should adhere to.
BSD 3-Clause "New" or "Revised" License
1.05k stars 362 forks source link

Serialising a File will also serialise the cache which can grow very large #1747

Closed phofl closed 4 days ago

phofl commented 2 weeks ago

We've run into this when using Xarray together with Dask. The default way of calling this is like this at the moment:

file_list = []
model = "ACCESS-CM2"
variable = "hurs"
data_dir = f"s3://nex-gddp-cmip6/NEX-GDDP-CMIP6/{model}/historical/r1i1p1f1/{variable}/*.nc"
file_list += [f"s3://{path}" for path in s3fs.glob(data_dir)]

# Opening the files here
files = [s3fs.open(f) for f in file_list]

ds = xr.open_mfdataset(
    files,
    engine="h5netcdf",
    combine="nested",
    concat_dim="time",
    data_vars="minimal",
    coords="minimal",
    compat="override",
    parallel=True,
)

The files are accessed initially to get the meta information before they are serialised. The initial access populates the cache with a lot of data.

This triggers a very large cache for 4 of the 130 files which is pretty bad when serialising things.

Serialising the cache doesn't seem like a great idea generally and specifically for remote file systems. Was this decision made intentionally or is this rather something that hasn't been a problem so far?

Ideally, we could purge the cache before things are serialised in fsspec / the inherited libraries. Is this something you would consider? I'd be happy to put up a PR if there is agreement about this.

martindurant commented 2 weeks ago

You are right, AbstractBufferedFile could do with a __reduce__ method to control what gets serialised. Certainly the cache should not be serialised, and indeed probably the file should be re-opened from fresh on each deserialise (perhaps with a seek to the correct place). Write-mode files should not be serialisable at all.

Note that OpenFile instances are designed exactly to encapsulate this kind of information without caches and other state - these are what should really be passed around.

Additionally, I don't really understand exactly what you are pickling - the xarray object itself? I don't know that such a use is really anticipated.

phofl commented 2 weeks ago

Additionally, I don't really understand exactly what you are pickling - the xarray object itself? I don't know that such a use is really anticipated.

No, not xarray objects themselves. Xarray extracts some metadata for netcdf files with the opened file and then this object is put into the graph (that happens in open_mfdataset fwiw). This causes the cache to be populated. I haven't dug through everything in the API yet, so can't tell you exactly when and where this happens

dcherian commented 1 week ago

OpenFile instances

These don't appear to fit the open_mfdataset pattern in the OP.

martindurant commented 1 week ago

Would someone like to write a reasonable reduce function for AbstractBufferedFile?

phofl commented 1 week ago

Yep, I would take a look at this

jrbourbeau commented 4 days ago

Btw, just following up here. If I add a small computation like

ds.hurs.mean(dim=["lon", "lat"]).compute()

to @phofl's original example, the graph is now much smaller with the changes in https://github.com/fsspec/filesystem_spec/pull/1753. With the latest fsspec release, the graph is 213.65 MiB and with the update in PR the graph goes down to 12.40 MiB. That's still larger than I'd like, but I'll take the 17x reduction in size