[reference] Stack Overflow when reading referenced file

TomAugspurger commented 3 years ago

probably an issue with fsspec.implementation.reference, but I'm not sure.

I'm trying this out on the Daymet Archive at https://azure.microsoft.com/en-us/services/open-datasets/catalog/daymet/.

I generate the offsets with

# file: gen.py
import adlfs
import fsspec_reference_maker.hdf
import os
import json
from pathlib import Path

fs = adlfs.AzureBlobFileSystem(account_name="daymet")
files = fs.glob("daymetv3-raw/daymet_v3_tmax_*_hawaii.nc4")

blob_name = files[0]
url = 'az://' + blob_name

offsets_root = Path("offsets")
offsets_root.mkdir(exist_ok=True)
p = (offsets_root / Path(blob_name).name).with_suffix(".json")

if not p.exists():
    print(f"Creating reference for {blob_name}")
    transformer = fsspec_reference_maker.hdf.Hdf5ToZarr(fs.open(blob_name), url, xarray=True)
    chunks = transformer.translate()
    with open(p, "w", encoding="utf-8") as f:
        json.dump(chunks, f)

Works great. But when I try to read the file I get a stack overflow exception.

# file: test.py
import fsspec
import xarray as xr

if __name__ == "__main__":
    p = "offsets/daymet_v3_tmax_1980_hawaii.json"
    mapper = fsspec.get_mapper("reference://",
              references=str(p),
              target_protocol="az",
              target_options={"account_name": "daymet"})
    dset = xr.open_zarr(mapper)
    print(dset)

$ python test.py
Fatal Python error: Cannot recover from stack overflow.
Python runtime state: initialized
...

Looking into the implementation. I see that cat_file recursively calls cat_file. Putting in some debug,

diff --git a/fsspec/implementations/reference.py b/fsspec/implementations/reference.py
index 5bfa6ae..8ef8d64 100644
--- a/fsspec/implementations/reference.py
+++ b/fsspec/implementations/reference.py
@@ -76,6 +76,8 @@ class ReferenceFileSystem(AsyncFileSystem):
         self.fs = fs

     async def _cat_file(self, path):
+        import inspect
+        print(path, len(inspect.stack(0)))
         path = self._strip_protocol(path)
         part = self.references[path]
         if isinstance(part, bytes):

I see

...
time/151 774
time/152 787
time/153 800
time/154 800
time/155 813
time/156 826
time/157 839
Fatal Python error: Cannot recover from stack overflow.
Python runtime state: initialized

Haven't looked any further than this.