fsspec / filesystem_spec

A specification that python filesystems should adhere to.
BSD 3-Clause "New" or "Revised" License
1.04k stars 362 forks source link

[reference] Stack Overflow when reading referenced file #498

Open TomAugspurger opened 3 years ago

TomAugspurger commented 3 years ago

probably an issue with fsspec.implementation.reference, but I'm not sure.

I'm trying this out on the Daymet Archive at https://azure.microsoft.com/en-us/services/open-datasets/catalog/daymet/.

I generate the offsets with

# file: gen.py
import adlfs
import fsspec_reference_maker.hdf
import os
import json
from pathlib import Path

fs = adlfs.AzureBlobFileSystem(account_name="daymet")
files = fs.glob("daymetv3-raw/daymet_v3_tmax_*_hawaii.nc4")

blob_name = files[0]
url = 'az://' + blob_name

offsets_root = Path("offsets")
offsets_root.mkdir(exist_ok=True)
p = (offsets_root / Path(blob_name).name).with_suffix(".json")

if not p.exists():
    print(f"Creating reference for {blob_name}")
    transformer = fsspec_reference_maker.hdf.Hdf5ToZarr(fs.open(blob_name), url, xarray=True)
    chunks = transformer.translate()
    with open(p, "w", encoding="utf-8") as f:
        json.dump(chunks, f)

Works great. But when I try to read the file I get a stack overflow exception.

# file: test.py
import fsspec
import xarray as xr

if __name__ == "__main__":
    p = "offsets/daymet_v3_tmax_1980_hawaii.json"
    mapper = fsspec.get_mapper("reference://",
              references=str(p),
              target_protocol="az",
              target_options={"account_name": "daymet"})
    dset = xr.open_zarr(mapper)
    print(dset)
$ python test.py
Fatal Python error: Cannot recover from stack overflow.
Python runtime state: initialized
...

Looking into the implementation. I see that cat_file recursively calls cat_file. Putting in some debug,

diff --git a/fsspec/implementations/reference.py b/fsspec/implementations/reference.py
index 5bfa6ae..8ef8d64 100644
--- a/fsspec/implementations/reference.py
+++ b/fsspec/implementations/reference.py
@@ -76,6 +76,8 @@ class ReferenceFileSystem(AsyncFileSystem):
         self.fs = fs

     async def _cat_file(self, path):
+        import inspect
+        print(path, len(inspect.stack(0)))
         path = self._strip_protocol(path)
         part = self.references[path]
         if isinstance(part, bytes):

I see

...
time/151 774
time/152 787
time/153 800
time/154 800
time/155 813
time/156 826
time/157 839
Fatal Python error: Cannot recover from stack overflow.
Python runtime state: initialized

Haven't looked any further than this.

martindurant commented 3 years ago

Any chance you can recreate this without azure? I have no account to try to replicate this.

TomAugspurger commented 3 years ago

Mmm this this dataset should be public. Let me make sure I wasn't accidentally using some keys.

TomAugspurger commented 3 years ago

Yeah, I was able to reproduce on another machine without credentials.

martindurant commented 3 years ago

Sorry, I thought specifying an "account_name" meant I would need an account. I can confirm the crash.

The cause is the lack of an async _cat_file in AzureBlobFileSystem. ReferenceFileSystem only works with async right now, and abfs is mostly async; but we implicitly require an async _cat_file(url, start, end) - and this ought not to use open/seek, but result in a direct call like https://github.com/dask/s3fs/blob/master/s3fs/core.py#L737 (note that start/end is directly encoded in the call headers).

martindurant commented 3 years ago

presumably the hard crash happens because what would ordinarily be an ordinary stack exception is inside the (C) event loop.

TomAugspurger commented 3 years ago

Thanks for that info. I'll see if I can make an async cat_file for adlfs.

martindurant commented 3 years ago

Note the leading underscore in the method name

martindurant commented 3 years ago

Is this fixed?