Closed phofl closed 4 days ago
You are right, AbstractBufferedFile could do with a __reduce__
method to control what gets serialised. Certainly the cache should not be serialised, and indeed probably the file should be re-opened from fresh on each deserialise (perhaps with a seek to the correct place). Write-mode files should not be serialisable at all.
Note that OpenFile instances are designed exactly to encapsulate this kind of information without caches and other state - these are what should really be passed around.
Additionally, I don't really understand exactly what you are pickling - the xarray object itself? I don't know that such a use is really anticipated.
Additionally, I don't really understand exactly what you are pickling - the xarray object itself? I don't know that such a use is really anticipated.
No, not xarray objects themselves. Xarray extracts some metadata for netcdf files with the opened file and then this object is put into the graph (that happens in open_mfdataset fwiw). This causes the cache to be populated. I haven't dug through everything in the API yet, so can't tell you exactly when and where this happens
OpenFile instances
These don't appear to fit the open_mfdataset pattern in the OP.
Would someone like to write a reasonable reduce function for AbstractBufferedFile?
Yep, I would take a look at this
Btw, just following up here. If I add a small computation like
ds.hurs.mean(dim=["lon", "lat"]).compute()
to @phofl's original example, the graph is now much smaller with the changes in https://github.com/fsspec/filesystem_spec/pull/1753. With the latest fsspec
release, the graph is 213.65 MiB and with the update in PR the graph goes down to 12.40 MiB. That's still larger than I'd like, but I'll take the 17x reduction in size
We've run into this when using Xarray together with Dask. The default way of calling this is like this at the moment:
The files are accessed initially to get the meta information before they are serialised. The initial access populates the cache with a lot of data.
This triggers a very large cache for 4 of the 130 files which is pretty bad when serialising things.
Serialising the cache doesn't seem like a great idea generally and specifically for remote file systems. Was this decision made intentionally or is this rather something that hasn't been a problem so far?
Ideally, we could purge the cache before things are serialised in fsspec / the inherited libraries. Is this something you would consider? I'd be happy to put up a PR if there is agreement about this.