fsspec / filesystem_spec

A specification that python filesystems should adhere to.
BSD 3-Clause "New" or "Revised" License
990 stars 347 forks source link

BytesCache use of .fetcher might lead to a stall if original coroutine stalls? #1666

Open yarikoptic opened 3 weeks ago

yarikoptic commented 3 weeks ago

We have an issue

the "fuller" traceback is ``` dandi/pynwb_utils.py:206: in _get_pynwb_metadata with open_readable(path) as fp, h5py.File(fp, "r") as h5, NWBHDF5IO( /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/h5py/_hl/files.py:562: in __init__ fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr) /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/h5py/_hl/files.py:235: in make_fid fid = h5f.open(name, flags, fapl=fapl) h5py/_objects.pyx:54: in h5py._objects.with_phil.wrapper ??? h5py/_objects.pyx:55: in h5py._objects.with_phil.wrapper ??? h5py/h5f.pyx:102: in h5py.h5f.open ??? h5py/h5fd.pyx:163: in h5py.h5fd.H5FD_fileobj_read ??? /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/fsspec/spec.py:1915: in readinto data = self.read(out.nbytes) /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/fsspec/implementations/http.py:598: in read return super().read(length) /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/fsspec/spec.py:1897: in read out = self.cache._fetch(self.loc, self.loc + length) /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/fsspec/caching.py:481: in _fetch self.cache = self.fetcher(start, bend) /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/fsspec/asyn.py:118: in wrapper return sync(self.loop, func, *args, **kwargs) /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/fsspec/asyn.py:91: in sync if event.wait(1): /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/threading.py:558: in wait signaled = self._cond.wait(timeout) /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/threading.py:306: in wait gotit = waiter.acquire(True, timeout) ```

relevant traceback is

/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/fsspec/caching.py:481: in _fetch
    self.cache = self.fetcher(start, bend)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/fsspec/asyn.py:118: in wrapper
    return sync(self.loop, func, *args, **kwargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/fsspec/asyn.py:91: in sync
    if event.wait(1):
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/threading.py:558: in wait
    signaled = self._cond.wait(timeout)

and the fact that asyn.py:sync signature is

def sync(loop, func, *args, timeout=None, **kwargs):

and code does not set any other timeout and keeps going until the end of times in that one unless original coroutine/thread signals:

    event = threading.Event()
    asyncio.run_coroutine_threadsafe(_runner(event, coro, result, timeout), loop)
    while True:
        # this loops allows thread to get interrupted
        if event.wait(1):
            break

but I guess in our case the http layer does not timeout promptly upon some odd connectivity issue? are reasonable timeouts set there?

martindurant commented 2 weeks ago

Are you suggesting that there should always be a timeout applied to sync() in general? The problem is, that operations can genuinely take very long, such as uploading whole trees of files (one outer coroutine runs many internal ones). I don't know how, in general, we would be able to detect "hang" situations.

The backends all have various timeouts themselves. In this case, you would have to look into the aiohttp documentation to see which are possibly relevant to your situations, and pass them along with your storage_options.