fsspec / adlfs

fsspec-compatible Azure Datake and Azure Blob Storage access
BSD 3-Clause "New" or "Revised" License
175 stars 104 forks source link

Xarray Serialisation Issues reading NetCDF from AzureBlobFile #477

Open alex-rakowski opened 3 months ago

alex-rakowski commented 3 months ago

Trying to read a NetCDF file in xarray and running into serialisation issues.

AzureBlobFile object contains a SimpleQueue, which is non trivial to serialise. Suspect that fsspec should be handling the serialisation differently.

Simple Reproducer:

from distributed.protocol import serialize, ToPickle

storage_options = {'connection_string':***, 'account_key': ***}
fs = fsspec.filesystem('abfs',**storage_options)
url = "<CONTAINER_NAME>"
files = fs.ls(url)
ds = xr.open_dataset(
    fs.open(files[0], 'rb'),
    chunks={'x': 2000, 'y': 2000},
    engine='h5netcdf',
)
serialize(ToPickle(list(ds.variables.values())[0]._data.dask))
TomAugspurger commented 3 months ago

Can you post the full traceback? What object has a reference to the queue?

alex-rakowski commented 3 months ago
2024-06-13 12:48:57,917 - distributed.protocol.pickle - ERROR - Failed to serialize <ToPickle: HighLevelGraph with 2 layers.
<dask.highlevelgraph.HighLevelGraph object at 0x31490b130>
 0. original-open_dataset-FSC-2bd87bcfc4ee55630c36125387cfd518
 1. open_dataset-FSC-2bd87bcfc4ee55630c36125387cfd518
>.
Traceback (most recent call last):
  File "/Users/arakowski/miniconda3/envs/pytorch-coiled/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 63, in dumps
    result = pickle.dumps(x, **dump_kwargs)
TypeError: cannot pickle 'weakref.ReferenceType' object

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/arakowski/miniconda3/envs/pytorch-coiled/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 68, in dumps
    pickler.dump(x)
TypeError: cannot pickle 'weakref.ReferenceType' object

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/arakowski/miniconda3/envs/pytorch-coiled/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 81, in dumps
    result = cloudpickle.dumps(x, **dump_kwargs)
  File "/Users/arakowski/miniconda3/envs/pytorch-coiled/lib/python3.10/site-packages/cloudpickle/cloudpickle.py", line 1479, in dumps
    cp.dump(obj)
  File "/Users/arakowski/miniconda3/envs/pytorch-coiled/lib/python3.10/site-packages/cloudpickle/cloudpickle.py", line 1245, in dump
    return super().dump(obj)
TypeError: cannot pickle 'weakref.ReferenceType' object

the 'weakref.ReferenceType' object will sometimes show as SimpleQueue when doing something more realistic with the dataset than shown in simple reproducer.

TomAugspurger commented 3 months ago

Thanks. We'll need to figure out which attributes of which objects aren't picklable. Some of these (like things from azure.storage.blob or azure.identity) might need to be pushed upstream. Others might need to be fixed here. Any research you can do here would be helpful.