[Bug]: Cannot use parallel for checks

tuanpham96 commented 9 months ago

What happened?

I've been having issues with running the inspector in parallel.

Doing this would be fine

nwbinspector data/nwb --config dandi --report-file-path inspector-results.txt

But using --n-jobs would tend to end in error

nwbinspector data/nwb --config dandi --report-file-path inspector-results.txt --n-jobs 8

The traceback is below:

concurrent.futures.process._RemoteTraceback:                                                                                
"""
Traceback (most recent call last):
  File "/users/lab/anaconda/nwb/lib/python3.11/multiprocessing/queues.py", line 244, in _feed
    obj = _ForkingPickler.dumps(obj)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/lab/anaconda/nwb/lib/python3.11/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'register_check.<locals>.register_check_and_auto_parse.<locals>.auto_parse_some_output'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/users/lab/anaconda/nwb/bin/nwbinspector", line 8, in <module>
    sys.exit(inspect_all_cli())
             ^^^^^^^^^^^^^^^^^
  File "/users/lab/anaconda/nwb/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/lab/anaconda/nwb/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/users/lab/anaconda/nwb/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/lab/anaconda/nwb/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/lab/anaconda/nwb/lib/python3.11/site-packages/nwbinspector/nwbinspector.py", line 280, in inspect_all_cli
    messages = list(
               ^^^^^
  File "/users/lab/anaconda/nwb/lib/python3.11/site-packages/nwbinspector/nwbinspector.py", line 454, in inspect_all
    for message in future.result():
                   ^^^^^^^^^^^^^^^
  File "/users/lab/anaconda/nwb/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/users/lab/anaconda/nwb/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/users/lab/anaconda/nwb/lib/python3.11/multiprocessing/queues.py", line 244, in _feed
    obj = _ForkingPickler.dumps(obj)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/lab/anaconda/nwb/lib/python3.11/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'register_check.<locals>.register_check_and_auto_parse.<locals>.auto_parse_some_output'

Operating System

Linux (not sure the exact OS, running on university cluster).

Python Version

Python 3.11.4

Were you streaming with ROS3?

No

Package Versions

``` hdmf 3.11.0 numpy 1.26.2 nwbinspector 0.4.33 pandas 2.1.3 pynwb 2.5.0 ```

Code of Conduct

[X] I agree to follow this project's Code of Conduct
[X] Have you ensured this bug was not already reported?
[X] To the best of your ability, have you ensured this is a bug within the code that checks the NWBFile, rather than a bug in the NWBFile reader (e.g., PyNWB or MatNWB)?

CodyCBakerPhD commented 9 months ago

Quite strange - we have tests for parallel usage in the API (https://github.com/NeurodataWithoutBorders/nwbinspector/blob/dev/tests/test_inspector.py#L212); and technically those leverage the same configuration wrapper via select mode rather than dandi

Tried adding a new API test for this call in https://github.com/NeurodataWithoutBorders/nwbinspector/pull/437, and I suppose I should add a CLI invocation and report saving just for good measure

Are you able by change to reproduce this in any way minimally? Would you be willing to share the NWB files in question or if on DANDI, point me towards the dandiset? Also can you share the results of the non-parallel report? (Wondering if it might be some specific check that in particular causes this, and that check is not in one of the base files we use for testing)

CodyCBakerPhD commented 9 months ago

Interesting; the CLI invocation reproduces it! https://github.com/NeurodataWithoutBorders/nwbinspector/actions/runs/7910884792/job/21594182388?pr=437

But the API usage doesn't... hmm

Will dig into this

CodyCBakerPhD commented 9 months ago

Though even more curiously seems to only affect linux and mac 😅

tuanpham96 commented 9 months ago

I tried to reproduce with simple test nwb files. Apparently the --config dandi is the cause.

import numpy as np
from datetime import datetime
from dateutil.tz import tzlocal
from pynwb import NWBFile, TimeSeries, NWBHDF5IO

num_files = 10

for i in range(num_files):
    start_time = datetime(2024, 1, 2, 3, 4, 5, tzinfo=tzlocal())
    name = f"test_{i:03d}"

    nwbfile = NWBFile(
        session_description=name,
        identifier=name,
        session_start_time=start_time,
    )

    ts = TimeSeries(
        name=f"ts{i}",
        data=np.random.randn(10, 200),
        unit="-",
        timestamps=np.arange(10),
    )
    nwbfile.add_acquisition(ts)

    with NWBHDF5IO(name + '.nwb', "w") as io:
        io.write(nwbfile)

Running with this is fine:

 nwbinspector . --n-jobs 4

But this would throw the pretty much the same error:

nwbinspector . --config dandi --n-jobs 4

Browsing the code, my guess is it has to do something with how the functions in available_checks are copied when a config is passed, but couldn't find a workaround.

tuanpham96 commented 9 months ago

Apparently, not using copy_check would be fine, although I'm not sure whether that is valid.

If I change this line:

https://github.com/NeurodataWithoutBorders/nwbinspector/blob/3b919939ee0c9351e0faba7468b48190f1faeb80/src/nwbinspector/nwbinspector.py#L160

to just

 mapped_check = check

things would work fine.

CodyCBakerPhD commented 9 months ago

But is the full testing suite OK with that change?

CodyCBakerPhD commented 9 months ago

BTW I'm having trouble reproducing this locally on my Windows and Linux systems, hence a bit of delay on getting to this

tuanpham96 commented 9 months ago

But is the full testing suite OK with that change?

I haven't run the tests yet. I'll try to run them later today or tomorrow on my local machine.

tuanpham96 commented 9 months ago

I tried running the tests on your PR branch locally on my machine (Linux) without copy_check, and I have weird results with test_inspector.py

Testing on that file would have the *parallel* tests pass, but fail on *importance_threshold* tests

pytest -v tests/test_inspector.py                                       
================================================================ test session starts =================================================================
platform linux -- Python 3.10.10, pytest-8.0.0, pluggy-1.4.0 -- 
cachedir: .pytest_cache
plugins: cov-4.1.0
collected 20 items                                                                                                                                   

tests/test_inspector.py::TestInspector::test_command_line_on_directory_matches_file PASSED                                                     [  5%]
tests/test_inspector.py::TestInspector::test_command_line_organizes_levels PASSED                                                              [ 10%]
tests/test_inspector.py::TestInspector::test_command_line_runs_cli_only PASSED                                                                 [ 15%]
tests/test_inspector.py::TestInspector::test_command_line_runs_cli_only_parallel PASSED                                                        [ 20%]
tests/test_inspector.py::TestInspector::test_command_line_runs_saves_json PASSED                                                               [ 25%]
tests/test_inspector.py::TestInspector::test_command_line_saves_report PASSED                                                                  [ 30%]
tests/test_inspector.py::TestInspector::test_inspect_all PASSED                                                                                [ 35%]
tests/test_inspector.py::TestInspector::test_inspect_nwbfile PASSED                                                                            [ 40%]
tests/test_inspector.py::TestInspector::test_inspect_nwbfile_dandi_config PASSED                                                               [ 45%]
tests/test_inspector.py::TestInspector::test_inspect_nwbfile_importance_threshold_as_importance FAILED                                         [ 50%]
tests/test_inspector.py::TestInspector::test_inspect_nwbfile_importance_threshold_as_string FAILED                                             [ 55%]
tests/test_inspector.py::TestInspector::test_inspect_nwbfile_manual_iteration PASSED                                                           [ 60%]
tests/test_inspector.py::TestInspector::test_inspect_nwbfile_manual_iteration_stop PASSED                                                      [ 65%]
tests/test_inspector.py::TestInspector::test_iterable_check_function PASSED                                                                    [ 70%]
tests/test_inspector.py::TestDANDIConfig::test_command_line_runs_parallel_with_config_and_report PASSED                                        [ 75%]
tests/test_inspector.py::TestDANDIConfig::test_inspect_nwbfile_dandi_config_critical_only_entire_registry PASSED                               [ 80%]
tests/test_inspector.py::TestDANDIConfig::test_inspect_nwbfile_dandi_config_violation_and_above_entire_registry PASSED                         [ 85%]
tests/test_inspector.py::TestCheckUniqueIdentifiersPass::test_check_unique_identifiers_pass PASSED                                             [ 90%]
tests/test_inspector.py::TestCheckUniqueIdentifiersFail::test_check_unique_identifiers_fail PASSED                                             [ 95%]
tests/test_inspector.py::test_dandi_config_in_vitro_injection PASSED                                                                           [100%]

But then re-doing with only the failed tests seem to pass.

pytest -v tests/test_inspector.py -k test_inspect_nwbfile_importance_threshold                                                                    
================================================================ test session starts =================================================================
platform linux -- Python 3.10.10, pytest-8.0.0, pluggy-1.4.0 -- /home/penguinsfly/Documents/work/fleischmann-lab-brown/general/dev-tools/contrib/nwbin
spector/.venv/bin/python                                                                                                                              
cachedir: .pytest_cache                                                                                                                               
rootdir: /home/penguinsfly/Documents/work/fleischmann-lab-brown/general/dev-tools/contrib/nwbinspector                                                
plugins: cov-4.1.0                                                                                                                                    
collected 20 items / 18 deselected / 2 selected                                                                                                       

tests/test_inspector.py::TestInspector::test_inspect_nwbfile_importance_threshold_as_importance PASSED                                         [ 50%] 
tests/test_inspector.py::TestInspector::test_inspect_nwbfile_importance_threshold_as_string PASSED                                             [100%]

Do you know why?

CodyCBakerPhD commented 9 months ago

Yeah, that's the reason the copy function exists in the first place. It's kind of messy to go into detail

Will try to think of broader workaround. For now I recommend running either without parallel and with config (slowest but most accurate), or without config and with parallel (recommended for very large number of files)

Note that the DANDI config just elevates certain subject related items to a high enough level to prevent validation - so you could theoretically just cross-reference the output against this file

tuanpham96 commented 9 months ago

Gotcha. Thanks for the suggestions!

NeurodataWithoutBorders / nwbinspector