NeurodataWithoutBorders / nwbinspector

Tool to help inspect NWB files for compliance with NWB Best Practices
https://nwbinspector.readthedocs.io/
Other
17 stars 10 forks source link

[Bug]: Can't read data (curl cannot perform request) #197

Closed CodyCBakerPhD closed 2 years ago

CodyCBakerPhD commented 2 years ago

What happened?

As I parse the remaining errors from the first attempt to inspect the entire archive, a number of repetitions of this issue occur. One basic example is in the traceback section below, but I've also attached all report files where it occurs, which spans multiple objects.

There also seem to be some related issues such as ConstructErrors and KeyError: 'Unable to open object by token (unable to determine object type)'.

In each of these cases, I am streaming the data via ros3, which could very well be a part of the problem. Interestingly, sometimes these curl errors occur outside of the nwbfile = io.read() step which causes the the dandiset to not be inspectable at all (which I'm considering a separate issue).

Anyway here is the full set of tracebacks available in the ERROR sections of each report...

000008.txt 000011.txt 000035.txt 000041.txt 000043.txt 000054.txt 000121.txt 000148.txt 000211.txt 000223.txt

Steps to Reproduce

No response

Traceback

0.0  <class 'OSError'>: Can't read data (curl cannot perform request)
---------------------------------------------------------------------

0.0.0  sub-mouse-GHNSF/sub-mouse-GHNSF_ses-20190828-sample-3_slice-20190828-slice-3_cell-20190828-sample-3_icephys.nwb: 'None' object with name 'None'
         Message: Traceback (most recent call last):
  File "/home/jovyan/GitHub/nwbinspector/nwbinspector/nwbinspector.py", line 476, in inspect_nwb
    nwbfile = io.read()
  File "/home/jovyan/my-conda-envs/nwbct_3_9/lib/python3.9/site-packages/hdmf/backends/hdf5/h5tools.py", line 498, in read
    return call_docval_func(super().read, kwargs)
  File "/home/jovyan/my-conda-envs/nwbct_3_9/lib/python3.9/site-packages/hdmf/utils.py", line 424, in call_docval_func
    return func(*fargs, **fkwargs)
  File "/home/jovyan/my-conda-envs/nwbct_3_9/lib/python3.9/site-packages/hdmf/utils.py", line 583, in func_call
    return func(args[0], **pargs)
  File "/home/jovyan/my-conda-envs/nwbct_3_9/lib/python3.9/site-packages/hdmf/backends/io.py", line 37, in read
    f_builder = self.read_builder()
  File "/home/jovyan/my-conda-envs/nwbct_3_9/lib/python3.9/site-packages/hdmf/utils.py", line 583, in func_call
    return func(args[0], **pargs)
  File "/home/jovyan/my-conda-envs/nwbct_3_9/lib/python3.9/site-packages/hdmf/backends/hdf5/h5tools.py", line 515, in read_builder
    f_builder = self.__read_group(self.__file, ROOT_NAME, ignore=ignore)
  File "/home/jovyan/my-conda-envs/nwbct_3_9/lib/python3.9/site-packages/hdmf/backends/hdf5/h5tools.py", line 656, in __read_group
    builder = read_method(sub_h5obj)
  File "/home/jovyan/my-conda-envs/nwbct_3_9/lib/python3.9/site-packages/hdmf/backends/hdf5/h5tools.py", line 656, in __read_group
    builder = read_method(sub_h5obj)
  File "/home/jovyan/my-conda-envs/nwbct_3_9/lib/python3.9/site-packages/hdmf/backends/hdf5/h5tools.py", line 656, in __read_group
    builder = read_method(sub_h5obj)
  File "/home/jovyan/my-conda-envs/nwbct_3_9/lib/python3.9/site-packages/hdmf/backends/hdf5/h5tools.py", line 683, in __read_dataset
    scalar = h5obj[()]
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/home/jovyan/my-conda-envs/nwbct_3_9/lib/python3.9/site-packages/h5py/_hl/dataset.py", line 775, in __getitem__
    self.id.read(mspace, fspace, arr, mtype)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5d.pyx", line 192, in h5py.h5d.DatasetID.read
  File "h5py/_proxy.pyx", line 112, in h5py._proxy.dset_rw
OSError: Can't read data (curl cannot perform request)

Operating System

Linux

Python Executable

Python

Python Version

3.9

Package Versions

No response

Code of Conduct

CodyCBakerPhD commented 2 years ago

@bendichter This may be a general streaming issue present for particular NWBFiles, so maybe we should transfer this issue to PyNWB, what do you think?

bendichter commented 2 years ago

Is this error repeatable for the same file? If so, can you put together a minimal example?

bendichter commented 2 years ago

It's hard to tell at this point of this is issue is being caused by a bug in DANDI, s3, h5py, ros3, or PyNWB

CodyCBakerPhD commented 2 years ago

It's hard to tell at this point of this is issue is being caused by a bug in DANDI, s3, h5py, ros3, or PyNWB

Exactly, I'll try downloading a local file from there and see

CodyCBakerPhD commented 2 years ago

Yeah IDK then, this was a reproducible error when running the archive-wide script on the Hub, but isolated and local attempts to reproduce behave fine.

Not a problem with the file, PyNWB reads a local copy just fine. Not a problem with streaming, I was able to open that particular file on DANDI just fine. Running a local stream inspection on that dandiset ID and it is so far running fine.

Maybe it's a parallelization issue, like multiple jobs trying to access assets from the same dandiset too quickly? Either way, deprioritizing this thing but keeping it in the back of our minds if we see it happen again.

bendichter commented 2 years ago

maybe bring up with the DANDI team?

CodyCBakerPhD commented 2 years ago

maybe bring up with the DANDI team?

I will when/if I can find a nice concrete way of reproducing it

anhknguyen96 commented 2 years ago

Not sure if this is a similar issue, but I inspected dandiset 000003 using inspect_all (code shown below)

from nwbinspector import inspect_all
from nwbinspector.register_checks import Importance

dandiset_id = "000003"  # for example, 000004
dandiset_type = "draft"  # or "published", if it has an official doi associated

message = list(inspect_all(path=dandiset_id,importance_threshold=Importance.BEST_PRACTICE_VIOLATION,progress_bar=True,stream=True,version_id='draft'))

After running for more than 2 hours (with 11/101 files inspected, is that normal?), it crashed with an OSError: Unable to open file (curl cannot perform request).

Attached is the report from the unfinished inspection, if it helps. I'd appreciate any thoughts on this. report_000003.csv

OSError                                   Traceback (most recent call last)
Input In [8], in <cell line: 15>()
      6 dandiset_type = "draft"  # or "published", if it has an official doi associated
      8 # messages = []
      9 # with DandiAPIClient() as client:
     10 #     dandiset = client.get_dandiset(dandiset_id, dandiset_type)
     11 #     for asset in dandiset.get_assets():
     12 #         s3_url = asset.get_content_url(follow_redirects=1, strip_query=True)
     13 #         messages.extend(list(inspect_nwb(nwbfile_path=s3_url, driver="ros3",importance_threshold=Importance.BEST_PRACTICE_VIOLATION)))
---> 15 message = list(inspect_all(path=dandiset_id,importance_threshold=Importance.BEST_PRACTICE_VIOLATION,progress_bar=True,stream=True,version_id='draft'))

File ~/.conda/envs/nwb_explorer/lib/python3.9/site-packages/nwbinspector/nwbinspector.py:394, in inspect_all(path, modules, config, ignore, select, importance_threshold, n_jobs, skip_validate, progress_bar, progress_bar_options, stream, version_id)
    392 else:
    393     for nwbfile_path in nwbfiles_iterable:
--> 394         for message in inspect_nwb(nwbfile_path=nwbfile_path, checks=checks, driver=driver):
    395             if stream:
    396                 message.file_path = nwbfiles[message.file_path]

File ~/.conda/envs/nwb_explorer/lib/python3.9/site-packages/nwbinspector/nwbinspector.py:457, in inspect_nwb(nwbfile_path, checks, config, ignore, select, importance_threshold, driver, skip_validate)
    455 filterwarnings(action="ignore", message="No cached namespaces found in .*")
    456 filterwarnings(action="ignore", message="Ignoring cached namespace .*")
--> 457 with pynwb.NWBHDF5IO(path=nwbfile_path, mode="r", load_namespaces=True, driver=driver) as io:
    458     if not skip_validate:
    459         validation_errors = pynwb.validate(io=io)

File ~/.conda/envs/nwb_explorer/lib/python3.9/site-packages/hdmf/utils.py:593, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    591 def func_call(*args, **kwargs):
    592     pargs = _check_args(args, kwargs)
--> 593     return func(args[0], **pargs)

File ~/.conda/envs/nwb_explorer/lib/python3.9/site-packages/pynwb/__init__.py:246, in NWBHDF5IO.__init__(self, **kwargs)
    244     elif manager is None:
    245         manager = get_manager()
--> 246 super(NWBHDF5IO, self).__init__(path, manager=manager, mode=mode, file=file_obj, comm=comm, driver=driver)

File ~/.conda/envs/nwb_explorer/lib/python3.9/site-packages/hdmf/utils.py:593, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    591 def func_call(*args, **kwargs):
    592     pargs = _check_args(args, kwargs)
--> 593     return func(args[0], **pargs)

File ~/.conda/envs/nwb_explorer/lib/python3.9/site-packages/hdmf/backends/hdf5/h5tools.py:81, in HDF5IO.__init__(self, **kwargs)
     79 self.__mode = mode
     80 self.__file = file_obj
---> 81 super().__init__(manager, source=path)
     82 self.__built = dict()       # keep track of each builder for each dataset/group/link for each file
     83 self.__read = dict()        # keep track of which files have been read. Key is the filename value is the builder

File ~/.conda/envs/nwb_explorer/lib/python3.9/site-packages/hdmf/utils.py:593, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    591 def func_call(*args, **kwargs):
    592     pargs = _check_args(args, kwargs)
--> 593     return func(args[0], **pargs)

File ~/.conda/envs/nwb_explorer/lib/python3.9/site-packages/hdmf/backends/io.py:23, in HDMFIO.__init__(self, **kwargs)
     21 self.__built = dict()
     22 self.__source = source
---> 23 self.open()

File ~/.conda/envs/nwb_explorer/lib/python3.9/site-packages/hdmf/backends/hdf5/h5tools.py:725, in HDF5IO.open(self)
    722 if self.driver is not None:
    723     kwargs.update(driver=self.driver)
--> 725 self.__file = File(self.source, open_flag, **kwargs)

File ~/.conda/envs/nwb_explorer/lib/python3.9/site-packages/h5py/_hl/files.py:507, in File.__init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, fs_strategy, fs_persist, fs_threshold, fs_page_size, page_buf_size, min_meta_keep, min_raw_keep, locking, **kwds)
    502     fapl = make_fapl(driver, libver, rdcc_nslots, rdcc_nbytes, rdcc_w0,
    503                      locking, page_buf_size, min_meta_keep, min_raw_keep, **kwds)
    504     fcpl = make_fcpl(track_order=track_order, fs_strategy=fs_strategy,
    505                      fs_persist=fs_persist, fs_threshold=fs_threshold,
    506                      fs_page_size=fs_page_size)
--> 507     fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
    509 if isinstance(libver, tuple):
    510     self._libver = libver

File ~/.conda/envs/nwb_explorer/lib/python3.9/site-packages/h5py/_hl/files.py:220, in make_fid(name, mode, userblock_size, fapl, fcpl, swmr)
    218     if swmr and swmr_support:
    219         flags |= h5f.ACC_SWMR_READ
--> 220     fid = h5f.open(name, flags, fapl=fapl)
    221 elif mode == 'r+':
    222     fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)

File h5py/_objects.pyx:54, in h5py._objects.with_phil.wrapper()

File h5py/_objects.pyx:55, in h5py._objects.with_phil.wrapper()

File h5py/h5f.pyx:106, in h5py.h5f.open()

OSError: Unable to open file (curl cannot perform request)
CodyCBakerPhD commented 2 years ago

After running for more than 2 hours (with 11/101 files inspected, is that normal?),

Definitely not normal for that DANDI set (I recall running it through streaming within 10 minutes or so, granted that was on the DANDIHub).

Good to know I'm not the only one that has run into this before - the fact you made it 11 files in means the streaming and s3 resolution was working for some files, certainly... but as to why it suddenly occasionally stops for other files in the same location, I still don't have an answer beyond speculation that it has something to do with the internet connection between the streaming device and the S3 bucket.

anhknguyen96 commented 2 years ago

I still don't have an answer beyond speculation that it has something to do with the internet connection between the streaming device and the S3 bucket

That's what I thought too. I'm not familiar with streaming via S3 so it's a shame if that's the case.

Definitely not normal for that DANDI set (I recall running it through streaming within 10 minutes or so, granted that was on the DANDIHub).

So from your experience, inspecting dandisets via streaming is done relatively quick for all dandisets?

CodyCBakerPhD commented 2 years ago

So from your experience, inspecting dandisets via streaming is done relatively quick for all dandisets?

Not all, no, but I'd say most. Depends on how many files are in each set; the slowest tend to be those sets out there with hundreds of files in them. Also depends on how many objects are in each file, and how many checks end up getting triggered in each file (most checks in the NWB Inspector have early-exit logic so they take almost no time to scan if they aren't initially triggered).

That's what I thought too. I'm not familiar with streaming via S3 so it's a shame if that's the case.

Would be worth trying again tomorrow or something.

In general I've never had problems with streaming a single NWB File via ros3, which I think is the more general use case. The NWB Inspector scanning of an entire DANDISet might be an edge case they didn't really expect to have to handle in terms of the number/frequency of curl requests to so many assets.

anhknguyen96 commented 2 years ago

Would be worth trying again tomorrow or something.

I've tried 3 times now, different times and days. The first time it crashed after sometime but I didn't document, the second time I had to terminate the process after 1h, and this is the third instance.

The NWB Inspector scanning of an entire DANDISet might be an edge case they didn't really expect to have to handle in terms of the number/frequency of curl requests to so many assets.

By they you mean the NWB Inspector? I suppose it works by accessing files sequentially and not in parallel?

anhknguyen96 commented 2 years ago

I inspected another dandiset (000059), which has 54 files and the process took more than 3 hours. The last file in the set encountered the <class 'OSError'>: Can't read data (curl cannot perform request) error (report attached). This dandiset isn't in the list of the dandisets that @CodyCBakerPhD inspected which had the same error. Did you run this dandiset previously? @CodyCBakerPhD

I'm not sure I understand why possibly this is a connection/streaming issue, assuming that the streamed files are accessed sequentially, and my device can still perform the inspection on several other dandisets (albeit very slowly and I haven't tried any that has as many files as 000003).

report_000059.csv

CodyCBakerPhD commented 2 years ago

I inspected another dandiset (000059), which has 54 files and the process took more than 3 hours.

Ah, worth mentioning that when I ran all these I did so on the DANDIHub (which has a direct connection to the archive) as well as an extra large spawn (I think I used 20+ CPU's) combined with the nice parallelization of the NWB Inspector (nwbinspector --n-jobs XX) so it took much less time than that.

Did you run this dandiset previously?

Indeed! And without issue. Here is our original report. 000059_successful_run.txt

I'm not sure I understand why possibly this is a connection/streaming issue, assuming that the streamed files are accessed sequentially, and my device can still perform the inspection on several other dandisets (albeit very slowly and I haven't tried any that has as many files as 000003).

What a strange problem. Maybe we'll consult this with the DANDI team and they might know the underlying issue (faulty S3 resolution, perhaps?)

Any ideas @yarikoptic @jwodder?

The TL;DR is we're trying to open a large number of NWBFiles iteratively via ros3 driver to examine metadata content of the files hosted on the DANDI Archive, but occasionally (or rather, with somewhat high probability) encounter OSError: Can't read data (curl cannot perform request) from h5py.

anhknguyen96 commented 2 years ago

Indeed! And without issue. Here is our original report.

Adding to the list (my version) of unsuccessful NWBInspector run, dandiset 000117. It managed to run 157/197 files and crashed with the same OSError: Can't read data (curl cannot perform request) error. Unfortunately, I used inspect_all, so reports of successfully inspected files weren't saved.

jwodder commented 2 years ago

Just to be clear, the URLs that you're trying to read are all S3 URLs, correct? I believe S3 recommends retrying any 5xx or connection errors up to about 10 or 12 times (can't find the reference right now).

anhknguyen96 commented 2 years ago

the URLs that you're trying to read are all S3 URLs, correct

yes

S3 recommends retrying any 5xx or connection errors up to about 10 or 12 times

I'm not sure I follow, would you please elaborate?

jwodder commented 2 years ago

@anhknguyen96 If an HTTP request to an S3 URL fails due to either a connection error (e.g., read timeout) or because the server returned a response with an error code in the 500-600 range, Amazon recommends retrying the request with an exponentially-increasing delay between retries; here's the best reference for that I could find at the moment. The official Python AWS SDK, boto3, does retries automatically, but if you're building your own S3 client, you'll need to add in retry behavior yourself.

CodyCBakerPhD commented 2 years ago

Thanks a bunch @jwodder I'll add the suggested exponential-delay retry whenever the ros3 driver is used.

anhknguyen96 commented 2 years ago

great, thanks all!

CodyCBakerPhD commented 2 years ago

Now finally evidence of this on a well-established file in the testing suite: https://github.com/NeurodataWithoutBorders/nwbinspector/runs/7188205359?check_suite_focus=true

I'll try to get this fix out in the next couple of days

CodyCBakerPhD commented 2 years ago

More evidence - https://github.com/NeurodataWithoutBorders/nwbinspector/runs/7282295969?check_suite_focus=true

This time, it occurred during CLI usage, as well as after nwbfile = io.read(), something I had been wondered if it was possible to occur. Meaning it read the initial file structure just fine but while collecting messages it was unable to pull something from the file due to the curl issue... this may require a more sophisticated solution than I have been initially implementing.

CodyCBakerPhD commented 2 years ago

And again! https://github.com/NeurodataWithoutBorders/nwbinspector/runs/7282540229?check_suite_focus=true

Never seen this many all in sequence, wonder if it has something to do with that CI being py v3.10, or the version h5py it uses for that?

bendichter commented 2 years ago

I don't think it has to do with the files or the code. I think it's a connection issue on the s3 side

yarikoptic commented 2 years ago

was recommended/agreed to be implemented retrying implemented and these fresh failures are with retrying or it is still "one in a bunch" fails?

CodyCBakerPhD commented 2 years ago

was https://github.com/NeurodataWithoutBorders/nwbinspector/issues/197#issuecomment-1171447729/https://github.com/NeurodataWithoutBorders/nwbinspector/issues/197#issuecomment-1171782303 retrying implemented and these fresh failures are with retrying

PR #223 tries to fix it, my recent posts were merely evidence outside of that PR that it can occur not just within the nwbfile = io.read() step (which all previous evidence had pointed to) but also within the actual evaluation of the checks on the file.

or it is still "one in a bunch" fails?

Yes, also just me mundanely remarking that I'm seeing more of these issues (even in the CI) than I did previously for whatever reason.

oruebel commented 2 years ago

evidence outside of that PR that it can occur not just within the nwbfile = io.read() step (which all previous evidence had pointed to) but also within the actual evaluation of the checks on the file.

Yes, read failures (in particular with network files) can occur any time read from the file directly happens. On nwbfile = io.read() the file structure and attributes are read, but any HDF5 datasets are not read. For inspector checks, this means if you access data from an h5py.Dataset object as part of the check, it will need to do a read from file, which in the case of S3 will require a one or more network messages.