Open xl0 opened 1 year ago
@xl0 you need to use nbdev_test --n_workers 0
to run it on serially on a single thread.
@deven367, The only difference between 0 and 1 workers is, with 0 the notebooks under test will be executed in the context of the nbdev_test process. With n_workers=1
, one ProcessPoolExecutor
worker process will be created, and the notebooks will be executed in this process one by one.
The resulting failure is the same in both cases.
I agree that this is hard to debug and ideally we wouldn't leak state.
It's been a while but IIRC these are two possible fixes:
InteractiveShells
and how to avoid/undo that, since we create a new shell for each notebook.multiprocessing
we can use the maxtasksperchild
arg. There doesn't seem to be an option for concurrent.futures
.I will take a look at number 2 once we've sorted out the mess with excessive diffs.
I'm not super familiar with parallel programming, especially in Python, so here's a simple solution that I can think of.
A partial function f
is passed to parallel()
, and it will spawn/fork a simple multiprocessing.Process
, pass the args to it, and wait for its completion and return the result.
Since in this case f
does not actually perform any computation, we can use a thread pool instead of a process pool.
Does this make sense @seeM , @jph00 ?
IMO the suggestion to use maxtasksperchild is probably the right fix.
So this was a bit of a rabbit hole.
I introduced class ThreadPool(multiprocessing.pool.Pool)
that more or less mirrors Fastcore ThreadPoolExecutor
.
Problem is, pool.Pool
creates daemon processes, and those can't have children on their own, and DataLoaders don't work. And there is no straightforward way to override this either.
The solution is https://stackoverflow.com/questions/6974695/python-process-pool-non-daemonic : You have to subclass the appropriate context class (the thing you get from get_context(...)
) and feed it with a subclassed Process
class :facepalm:
https://github.com/xl0/fastcore/commit/6035197f9727cffc1aaf5008f12c28e1b96fad60
With parallel() supporting ProcessPool, the change to nbdev is trivial: https://github.com/xl0/nbdev/commit/93a7478e4efe737e0daf59e22d875c5cf47040e3
@jph00 , @seeM , Does this look OK to you?
I ran nbdev_test for fastcore, nbdev, and fastai, and all 3 now pass without issue.
One extra thought. If someone is trying to debug their failed tests and passes --n_workers=0
, nbdev_test will process all notebooks in the parent process and will revert to the buggy behaviour. Should we disallow using 0 workers for more than 1 target notebook?
I think this looks reasonable - it's more complicated than I'd hoped, but it sounds like there might not be a simpler option? Although I see something on that SO page saying perhaps nested pools works on on py38+?
I don't think we should disallow using 0 workers for more than 1 target notebook -- people should be able to choose to do what they want.
@jph00 , I think the SO reply refers to concurrent.futures.ProcessPoolExecutor
, which does not support max tasks per child.
Please have a look at https://github.com/fastai/fastcore/pull/515 and #1296
Hey folks, what's the status on this? Not sure what/how to fix this, despite reading above.
I'm seeing both nbdev_prepare
and nbdev_test
for fresh installs of nbdev
v 2.3.13 & 2.3.14 (from GitHub source) crashing on Apple Silicon (homebrew, Python 3.11.6, brand-new Python venv
, package installs via pip
):
$ nbdev_test
/AppleInternal/Library/BuildRoots/495c257e-668e-11ee-93ce-926038f30c31/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Utility/MPSKernelDAG.mm:77: failed assertion `Failed to stich function with error: Error Domain=MTLLibraryErrorDomain Code=3 "Compiler encountered XPC_ERROR_CONNECTION_INVALID (is the OS shutting down?)" UserInfo={NSLocalizedDescription=Compiler encountered XPC_ERROR_CONNECTION_INVALID (is the OS shutting down?)}'
Traceback (most recent call last):
File "/Users/shawley/envs/aeiou/bin/nbdev_test", line 8, in <module>
sys.exit(nbdev_test())
^^^^^^^^^^^^
File "/Users/shawley/envs/aeiou/lib/python3.11/site-packages/fastcore/script.py", line 119, in _f
return tfunc(**merge(args, args_from_prog(func, xtra)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/shawley/envs/aeiou/lib/python3.11/site-packages/nbdev/test.py", line 90, in nbdev_test
results = parallel(test_nb, files, skip_flags=skip_flags, force_flags=force_flags, n_workers=n_workers,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/shawley/envs/aeiou/lib/python3.11/site-packages/fastcore/parallel.py", line 117, in parallel
return L(r)
^^^^
File "/Users/shawley/envs/aeiou/lib/python3.11/site-packages/fastcore/foundation.py", line 98, in __call__
return super().__call__(x, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/shawley/envs/aeiou/lib/python3.11/site-packages/fastcore/foundation.py", line 106, in __init__
items = listify(items, *rest, use_list=use_list, match=match)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/shawley/envs/aeiou/lib/python3.11/site-packages/fastcore/basics.py", line 66, in listify
elif is_iter(o): res = list(o)
^^^^^^^
File "/opt/homebrew/Cellar/python@3.11/3.11.6_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/process.py", line 620, in _chain_from_iterable_of_lists
for element in iterable:
File "/opt/homebrew/Cellar/python@3.11/3.11.6_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
yield _result_or_cancel(fs.pop())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.11/3.11.6_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
return fut.result(timeout)
^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.11/3.11.6_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/_base.py", line 456, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.11/3.11.6_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
Note: nbdev_export
works fine. Only seeing the crash with nbdev_test
& nbdev_prepare
.
Issue is only on Mac, no problems with Linux. So for now, I'm switching to Linux.
Run
nbdev_test --n_workers 1 --file_re "0[07].*" --do_print
infastai
repo (after installing the dev dependencies) Error I'm getting:--file_re "0[07].*"
meaning, test notebooks 00 and 07. nbdev_test will usefastcore
ProcessPoolExecutor
with 1 worker, which means the worker process will be re-used between notebooks.I don't completely understand the source of the error - both notebooks 00 and 07 make use of
test_fig_exists(ax)
, but only one of them fails, but pretty certain it comes from some internal state being shared between notebooks by the re-used process: No failure occurs if notebooks are run separately, or if--n_workers 2
and each notebook gets a fresh worker process.While the issue could possibly be fixed in
fastai
, I believe that the behaviour is itself a bug - a test framework should not leak state between unrelated notebooks. This behaviour results in unexpected and frustrating failures. I earlier ran into a similar issue using JAX with nbdev, when two notebooks would initialize JAX with different parameters, but since the process is being re-used, the second initialization in the same process is ignored.