NeurodataWithoutBorders / lindi

Linked Data Interface (LINDI) - cloud-friendly access to NWB data
BSD 3-Clause "New" or "Revised" License
4 stars 1 forks source link

Temporary failure in name resolution #81

Closed vncntprvst closed 5 months ago

vncntprvst commented 5 months ago

Hi, I'm testing lindi, following this discussion. This is the code I'm running:

import pynwb
import lindi

# URL of the remote .nwb.lindi.json file
url = "https://lindi.neurosift.org/dandi/dandisets/000363/assets/21c622b7-6d8e-459b-98e8-b968a97a1585/nwb.lindi.json"

# Set up a local cache
local_cache = lindi.LocalCache(cache_dir='lindi_cache')

# Create the h5py-like client with cache
# # client = lindi.LindiH5pyFile.from_lindi_file(url)
client = lindi.LindiH5pyFile.from_lindi_file(url, local_cache=local_cache)

# Open using pynwb
with pynwb.NWBHDF5IO(file=client, mode="r") as io:
    nwbfile = io.read()

print(nwbfile)

trials_df = nwbfile.trials.to_dataframe()
units_df = nwbfile.units.to_dataframe()

It worked up to (and including) trials_df = nwbfile.trials.to_dataframe(). However, at units_df = nwbfile.units.to_dataframe(), I got this error:

Traceback (most recent call last):
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/urllib3/connection.py", line 198, in _new_conn
    sock = connection.create_connection(
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/urllib3/util/connection.py", line 60, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/socket.py", line 955, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/urllib3/connectionpool.py", line 793, in urlopen
    response = self._make_request(
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/urllib3/connectionpool.py", line 491, in _make_request
    raise new_e
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/urllib3/connectionpool.py", line 467, in _make_request
    self._validate_conn(conn)
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1099, in _validate_conn
    conn.connect()
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/urllib3/connection.py", line 616, in connect
    self.sock = sock = self._new_conn()
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/urllib3/connection.py", line 205, in _new_conn
    raise NameResolutionError(self.host, self, e) from e
urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x748ecdad1660>: Failed to resolve 'dandiarchive.s3.amazonaws.com' ([Errno -3] Temporary failure in name resolution)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/requests/adapters.py", line 589, in send
    resp = conn.urlopen(
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/urllib3/connectionpool.py", line 847, in urlopen
    retries = retries.increment(
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/urllib3/util/retry.py", line 515, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='dandiarchive.s3.amazonaws.com', port=443): Max retries exceeded with url: /blobs/886/c43/886c4302-846a-4ef5-996a-6f02d6a81a5f?response-content-disposition=attachment%3B%20filename%3D%22sub-440956_ses-20190207T120657_behavior%2Becephys%2Bimage%2Bogen.nwb%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20240611%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20240611T140109Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=b3d9ee6212e1188568d787dfc3ae894dcede3a9fca10cdba28e42b1f8039bde1 (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x748ecdad1660>: Failed to resolve 'dandiarchive.s3.amazonaws.com' ([Errno -3] Temporary failure in name resolution)"))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/hdmf/utils.py", line 668, in func_call
    return func(args[0], **pargs)
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/hdmf/common/table.py", line 1225, in to_dataframe
    sel = self.__get_selection_as_dict(arg, df=True, **kwargs)
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/hdmf/common/table.py", line 1063, in __get_selection_as_dict
    ret[name] = col.get(arg, df=df, index=index, **kwargs)
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/hdmf/common/table.py", line 203, in get
    ret.append(self.__getitem_helper(i, **kwargs))
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/hdmf/common/table.py", line 172, in __getitem_helper
    end = self.data[arg]
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/lindi/LindiH5pyFile/LindiH5pyDataset.py", line 170, in __getitem__
    return self._get_item_for_zarr(self._zarr_array, args)
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/lindi/LindiH5pyFile/LindiH5pyDataset.py", line 219, in _get_item_for_zarr
    return decode_references(zarr_array[selection])
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/zarr/core.py", line 800, in __getitem__
    result = self.get_basic_selection(pure_selection, fields=fields)
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/zarr/core.py", line 926, in get_basic_selection
    return self._get_basic_selection_nd(selection=selection, out=out, fields=fields)
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/zarr/core.py", line 968, in _get_basic_selection_nd
    return self._get_selection(indexer=indexer, out=out, fields=fields)
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/zarr/core.py", line 1343, in _get_selection
    self._chunk_getitems(
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/zarr/core.py", line 2179, in _chunk_getitems
    cdatas = self.chunk_store.getitems(ckeys, contexts=contexts)
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/zarr/_storage/store.py", line 179, in getitems
    return {k: self[k] for k in keys if k in self}
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/zarr/_storage/store.py", line 179, in <dictcomp>
    return {k: self[k] for k in keys if k in self}
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/lindi/LindiH5pyFile/LindiReferenceFileSystemStore.py", line 147, in __getitem__
    val = _read_bytes_from_url_or_path(url, offset, length)
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/lindi/LindiH5pyFile/LindiReferenceFileSystemStore.py", line 259, in _read_bytes_from_url_or_path
    response = requests.get(url_resolved, headers=headers)
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/home/wanglab/mambaforge/envs/map_ephys/lib/python3.10/site-packages/requests/adapters.py", line 622, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='dandiarchive.s3.amazonaws.com', port=443): Max retries exceeded with url: /blobs/886/c43/886c4302-846a-4ef5-996a-6f02d6a81a5f?response-content-disposition=attachment%3B%20filename%3D%22sub-440956_ses-20190207T120657_behavior%2Becephys%2Bimage%2Bogen.nwb%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20240611%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20240611T140109Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=b3d9ee6212e1188568d787dfc3ae894dcede3a9fca10cdba28e42b1f8039bde1 (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x748ecdad1660>: Failed to resolve 'dandiarchive.s3.amazonaws.com' ([Errno -3] Temporary failure in name resolution)"))

I haven't tested that on other assets, so I'm not sure if that issue is specific or not.

magland commented 5 months ago

It worked for me, although it took a lot longer than I expected to load. I am going to investigate why this is taking so long.

I think your error was just a network failure. (I suppose we'll want to implement retries)

magland commented 5 months ago

The .lindi.json file is itself around 80 MB, so it takes a bit of time to do the initial download.

Then there are a very large number of objects in the file. But I'm surprised that it takes so long for pynwb to load and process those. I'm taking a closer look...

And the units table I would expect to be very fast to load. Looking into it.

rly commented 5 months ago

It worked for me as well. The initial file open & read took about 30 seconds. The trials dataframe was fast. The units dataframe took another ~1.5 min.

When developing PyNWB/HDMF, we did not try to minimize the number of reads, especially when converting DynamicTable objects to pandas DataFrames, so there are likely to be inefficiencies there.

magland commented 5 months ago

Regarding the units table... I think to_dataframe() might not make a lot of sense in this context because maybe it is trying to put all the spike times in there? Not sure. But I think that may be why it takes so long. But I think the actual loading of data using lindi should be efficient.

rly commented 5 months ago

maybe it is trying to put all the spike times in there

Yeah, all data in the table is read immediately (as opposed to lazily) when converting to a pandas DataFrame

oruebel commented 5 months ago

When developing PyNWB/HDMF, we did not try to minimize the number of reads, especially when converting DynamicTable

One specific example is reading of spike_times from the units table, or more broadly, reading of ragged array columns where values in VectorData are read via a VectorIndex. Here is the related issue on hdmf_zarr that describes this specific problem in more detail: https://github.com/hdmf-dev/hdmf-zarr/issues/141 as well as a corresponding issue on the nwb_benchmarks to add this to our test suite https://github.com/NeurodataWithoutBorders/nwb_benchmarks/issues/13

vncntprvst commented 5 months ago

Thanks for all the feedback, and for developing this tool. I admittedly did not spend much time trying to debug this, I'm on a deadline... I'll definitely use it in my projects, it's really useful.