Closed natanlao closed 5 years ago
Thanks for reporting @natanlao
Given your repository is open source, can you please tell us a way to reproduce this problem with smallest possible set of notebooks maybe? There are a few things I'd like to try out.
Refactor treon to use multiprocessing instead of multithreading
Please pardon my lack of knowledge here but what is it about processes that would make it possible instead of threads. Is it that the processes can have different working directory but threads cannot (as threads derive working directory from the parent process)?
If I'm understanding correctly, it is not so much about the treon internals (if it uses threads or processes), but rather about adding the correct working directory parameter when starting the Jupyter kernel the notebook will run in.
I think it is a good change to set the current working directory to the folder containing the notebook. My primary use-case is importing a Python module containing common setup-code, which lives in the same directory as the notebooks.
While tinkering with treon, I already made this change once; it is rather straightforward, see test_execution.py#execute_notebook
. If desired, filing a PR would be no problem. We need to decide if this behavior just changed, or if it must be configurable. Maybe some folks already rely on the current behavior to keep the working directory of treon's invocation.
adding the correct working directory parameter when starting the Jupyter kernel the notebook will run in
ah, absolutely right.
We need to decide if this behavior just changed, or if it must be configurable. Maybe some folks already rely on the current behavior to keep the working directory of treon's invocation.
Our default should match the default of Jupyter UI. From this comment, The kernel started for a notebook has a working directory of where the notebook file is
so we should also use this as default.
Thanks! I didn't know that that solution was possible.
We're using treon for HumanCellAtlas/data-consumer-vignettes, and ran into a problem where notebooks that refer to relative paths can unexpectedly fail when tested with treon.
These notebooks expect the current working directory to be the directory that the notebook resides in. When using treon, this is not likely to be the case, since treon searches recursively for notebooks to test. (In this case, the current working directory is the directory where
treon
was invoked.)I've been able to solve this locally by changing the working directory to the directory in which the notebook resides before testing the notebook. That said, this solution only works if treon is limited to a single thread (or only testing one notebook), since the current working directory is shared across all threads.
There are a few approaches to this that I can think of:
Refactor treon to use multiprocessing instead of multithreading. Skimming the source code, it seems like this change would be more or less trivial. Using multiprocessing would have the benefit of working directory isolation, in addition to potential performance improvements for testing CPU-bound notebooks by circumventing the GIL, at the cost of some performance overhead.
Add an option to perform the same directory-switching that I used above that limits the number of parallel threads to one.
Ignore this problem - our workaround is to handle the directory-switching ourselves, running treon with one notebook at a time, while still achieving parallelism with
xargs
.I'm not sure if this is a widespread use case, and I'm happy to take a shot at either of these myself.