fastai / nbdev

Create delightful software with Jupyter Notebooks
https://nbdev.fast.ai/
Apache License 2.0
4.9k stars 487 forks source link

nbdev_test parallel() process re-use results in hard to debug behaviour #1287

Open xl0 opened 1 year ago

xl0 commented 1 year ago

Run nbdev_test --n_workers 1 --file_re "0[07].*" --do_print in fastai repo (after installing the dev dependencies) Error I'm getting:

xl0@vespa:~/work/code/fastai$ nbdev_test --n_workers 1  --file_re "0[70].*" --do_print
Starting /ssd/xl0/work/code/fastai/nbs/00_torch_core.ipynb
- Completed /ssd/xl0/work/code/fastai/nbs/00_torch_core.ipynb
Starting /ssd/xl0/work/code/fastai/nbs/07_vision.core.ipynb
AttributeError in /ssd/xl0/work/code/fastai/nbs/07_vision.core.ipynb:
===========================================================================

While Executing Cell #41:
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[1], line 1
----> 1 test_fig_exists(ax)

File ~/mambaforge/envs/test/lib/python3.9/site-packages/fastcore/test.py:101, in test_fig_exists(ax)
     99 def test_fig_exists(ax):
    100     "Test there is a figure displayed in `ax`"
--> 101     assert ax and len(ax.figure.canvas.tostring_argb())

File ~/mambaforge/envs/test/lib/python3.9/site-packages/matplotlib/backends/backend_agg.py:438, in FigureCanvasAgg.tostring_argb(self)
    431 def tostring_argb(self):
    432     """
    433     Get the image as ARGB `bytes`.
    434 
    435     `draw` must be called at least once before this function will work and
    436     to update the renderer for any subsequent changes to the Figure.
    437     """
--> 438     return self.renderer.tostring_argb()

AttributeError: 'FigureCanvasAgg' object has no attribute 'renderer'

- Completed /ssd/xl0/work/code/fastai/nbs/07_vision.core.ipynb

nbdev Tests Failed On The Following Notebooks:
==================================================
        07_vision.core.ipynb
(test) xl0@vespa:~/work/code/fastai$ 

--file_re "0[07].*" meaning, test notebooks 00 and 07. nbdev_test will use fastcore ProcessPoolExecutor with 1 worker, which means the worker process will be re-used between notebooks.

I don't completely understand the source of the error - both notebooks 00 and 07 make use of test_fig_exists(ax), but only one of them fails, but pretty certain it comes from some internal state being shared between notebooks by the re-used process: No failure occurs if notebooks are run separately, or if --n_workers 2 and each notebook gets a fresh worker process.

While the issue could possibly be fixed in fastai, I believe that the behaviour is itself a bug - a test framework should not leak state between unrelated notebooks. This behaviour results in unexpected and frustrating failures. I earlier ran into a similar issue using JAX with nbdev, when two notebooks would initialize JAX with different parameters, but since the process is being re-used, the second initialization in the same process is ignored.

deven367 commented 1 year ago

@xl0 you need to use nbdev_test --n_workers 0 to run it on serially on a single thread.

xl0 commented 1 year ago

@deven367, The only difference between 0 and 1 workers is, with 0 the notebooks under test will be executed in the context of the nbdev_test process. With n_workers=1, one ProcessPoolExecutor worker process will be created, and the notebooks will be executed in this process one by one.

The resulting failure is the same in both cases.

seeM commented 1 year ago

I agree that this is hard to debug and ideally we wouldn't leak state.

It's been a while but IIRC these are two possible fixes:

xl0 commented 1 year ago

I will take a look at number 2 once we've sorted out the mess with excessive diffs.

xl0 commented 1 year ago

I'm not super familiar with parallel programming, especially in Python, so here's a simple solution that I can think of.

A partial function f is passed to parallel(), and it will spawn/fork a simple multiprocessing.Process, pass the args to it, and wait for its completion and return the result.

Since in this case f does not actually perform any computation, we can use a thread pool instead of a process pool.

Does this make sense @seeM , @jph00 ?

jph00 commented 1 year ago

IMO the suggestion to use maxtasksperchild is probably the right fix.

xl0 commented 1 year ago

So this was a bit of a rabbit hole.

I introduced class ThreadPool(multiprocessing.pool.Pool) that more or less mirrors Fastcore ThreadPoolExecutor. Problem is, pool.Pool creates daemon processes, and those can't have children on their own, and DataLoaders don't work. And there is no straightforward way to override this either.

The solution is https://stackoverflow.com/questions/6974695/python-process-pool-non-daemonic : You have to subclass the appropriate context class (the thing you get from get_context(...) ) and feed it with a subclassed Process class :facepalm:

https://github.com/xl0/fastcore/commit/6035197f9727cffc1aaf5008f12c28e1b96fad60

With parallel() supporting ProcessPool, the change to nbdev is trivial: https://github.com/xl0/nbdev/commit/93a7478e4efe737e0daf59e22d875c5cf47040e3

@jph00 , @seeM , Does this look OK to you?

I ran nbdev_test for fastcore, nbdev, and fastai, and all 3 now pass without issue.

One extra thought. If someone is trying to debug their failed tests and passes --n_workers=0, nbdev_test will process all notebooks in the parent process and will revert to the buggy behaviour. Should we disallow using 0 workers for more than 1 target notebook?

jph00 commented 1 year ago

I think this looks reasonable - it's more complicated than I'd hoped, but it sounds like there might not be a simpler option? Although I see something on that SO page saying perhaps nested pools works on on py38+?

I don't think we should disallow using 0 workers for more than 1 target notebook -- people should be able to choose to do what they want.

xl0 commented 1 year ago

@jph00 , I think the SO reply refers to concurrent.futures.ProcessPoolExecutor, which does not support max tasks per child.

Please have a look at https://github.com/fastai/fastcore/pull/515 and #1296

drscotthawley commented 9 months ago

Hey folks, what's the status on this? Not sure what/how to fix this, despite reading above.

I'm seeing both nbdev_prepare and nbdev_test for fresh installs of nbdev v 2.3.13 & 2.3.14 (from GitHub source) crashing on Apple Silicon (homebrew, Python 3.11.6, brand-new Python venv, package installs via pip):

$ nbdev_test
/AppleInternal/Library/BuildRoots/495c257e-668e-11ee-93ce-926038f30c31/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Utility/MPSKernelDAG.mm:77: failed assertion `Failed to stich function with error: Error Domain=MTLLibraryErrorDomain Code=3 "Compiler encountered XPC_ERROR_CONNECTION_INVALID (is the OS shutting down?)" UserInfo={NSLocalizedDescription=Compiler encountered XPC_ERROR_CONNECTION_INVALID (is the OS shutting down?)}'
Traceback (most recent call last):
  File "/Users/shawley/envs/aeiou/bin/nbdev_test", line 8, in <module>
    sys.exit(nbdev_test())
             ^^^^^^^^^^^^
  File "/Users/shawley/envs/aeiou/lib/python3.11/site-packages/fastcore/script.py", line 119, in _f
    return tfunc(**merge(args, args_from_prog(func, xtra)))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shawley/envs/aeiou/lib/python3.11/site-packages/nbdev/test.py", line 90, in nbdev_test
    results = parallel(test_nb, files, skip_flags=skip_flags, force_flags=force_flags, n_workers=n_workers,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shawley/envs/aeiou/lib/python3.11/site-packages/fastcore/parallel.py", line 117, in parallel
    return L(r)
           ^^^^
  File "/Users/shawley/envs/aeiou/lib/python3.11/site-packages/fastcore/foundation.py", line 98, in __call__
    return super().__call__(x, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shawley/envs/aeiou/lib/python3.11/site-packages/fastcore/foundation.py", line 106, in __init__
    items = listify(items, *rest, use_list=use_list, match=match)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shawley/envs/aeiou/lib/python3.11/site-packages/fastcore/basics.py", line 66, in listify
    elif is_iter(o): res = list(o)
                           ^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.6_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/process.py", line 620, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/opt/homebrew/Cellar/python@3.11/3.11.6_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.6_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.6_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.6_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

Note: nbdev_export works fine. Only seeing the crash with nbdev_test & nbdev_prepare.

Issue is only on Mac, no problems with Linux. So for now, I'm switching to Linux.