Closed mdickinson closed 2 years ago
Here's a complete transcript of a shell session on my machine:
The last line of output above was produced at 16:39 local time on my machine. It's now 16:49, and there's been no further output in the last 10 minutes.
Interestingly, the hang in the test suite also seems to be intermittent: just now I was able to get a complete test run. This again would be consistent with some kind of deadlock, and may have the same underlying cause as the hangs we've been seeing at installation time. (e.g., #1049)
I'm not even sure how the installation worked without hanging in the first place - because we know that there's an issue with vtk 9.0.3 that we haven't resolved yet on the default master branch. Is this not an issue on MacOS?
Is this not an issue on MacOS?
Sometimes the installation hangs, sometimes it doesn't. On my machine it seems to succeed more often than not.
Sometimes the installation hangs, sometimes it doesn't. On my machine it seems to succeed more often than not.
I didn't know that. On my Windows machine, it has always hanged with VTK 9.0.3 - but by always, I'm talking about 10s of attempts and not 100s so I am not a 100% sure this isn't an intermittent failure on Windows.
Looking at the transcript, it looks like the testsuite hangs before any of the tests are even run - so it's hanging as soon as the first test setup is happening. Is that a correct assumption?
I suspect that the hang is happening at test discovery time (so likely at the point of importing something from tvtk / vtk); I need to find time to get more information. I also get hangs if running through python -m unittest discover
rather than through nose.
I suspect that the hang is happening at test discovery time
Confirmed: I'm using unittest2 (mostly so that I can freely add debugging information without messing with the "system" Python 3.9 install) and frequently (but not always) seeing a hang when trying to load tests from mayavi.tests.test_composite_data
.
(mayavi) mdickinson@mirzakhani mayavi % unit2
Doing discovery
Checkpoint 1
Branch 1
Finding tests
_get_module_from_name ... : mayavi.tests.test_array_source
_get_module_from_name done: mayavi.tests.test_array_source
_get_module_from_name ... : mayavi.tests.test_builtin_image
_get_module_from_name done: mayavi.tests.test_builtin_image
_get_module_from_name ... : mayavi.tests.test_builtin_surface
_get_module_from_name done: mayavi.tests.test_builtin_surface
_get_module_from_name ... : mayavi.tests.test_composite_data
And simply importing that test module at a prompt hangs (at least, it did the first time I tried it; like everything else here it may well be intermittent):
(mayavi) mdickinson@mirzakhani mayavi % python
Python 3.9.7 (default, Sep 1 2021, 04:00:27)
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import mayavi.tests.test_composite_data
(No output following the import statement, for several minutes.)
I wonder if it's the TestMLabNullEngine
that's causing the problems. Can you try running or importing one of the other tests that use it? See https://github.com/enthought/mayavi/search?q=TestMlabNullEngine
I've narrowed it down to from mayavi import mlab
.
Further narrowed to python -c "import tvtk.tools.tvtk_doc"
, and in particular the execution of get_tvtk_class_names()
in that module.
And at this point we're getting into Heisenbug territory: the hang is occurring during the main for name in dir(vtk)
loop in get_tvtk_class_names
, but if I start adding debug print
statements to find out exactly which name it's hanging on, the hang is no longer easy to reproduce, but does still occur occasionally.
I've seen two hangs on vtkDataObject
so far.
I was hoping nothing apart from vtkDataEncoder
, vtkWebApplication
were causing the crashes but alas.
Here's an LLDB backtrace from one case of the hang (looks like it was vtkDataEncoder
this time). It's hanging on a thread join operation.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
* frame #0: 0x00007fff6d05f55e libsystem_kernel.dylib`__ulock_wait + 10
frame #1: 0x00007fff6d1225c2 libsystem_pthread.dylib`_pthread_join + 347
frame #2: 0x000000010270b8a2 libvtkCommonCore-9.0.dylib`vtkMultiThreader::TerminateThread(int) + 226
frame #3: 0x0000000102a9d5cc libvtkWebCore-9.0.dylib`vtkDataEncoder::vtkInternals::TerminateAllWorkers() + 124
frame #4: 0x0000000102a9d67d libvtkWebCore-9.0.dylib`vtkDataEncoder::~vtkDataEncoder() + 29
frame #5: 0x000000010249ce3d libvtkWrappingPythonCore-9.0.dylib`vtkPythonUtil::RemoveObjectFromMap(_object*) + 157
frame #6: 0x0000000102481215 libvtkWrappingPythonCore-9.0.dylib`PyVTKObject_Delete + 37
frame #7: 0x00000001019b0b57 Python`_PyEval_EvalFrameDefault + 2263
frame #8: 0x0000000101910e1a Python`function_code_fastcall + 97
frame #9: 0x00000001019b9086 Python`call_function + 403
frame #10: 0x00000001019b6653 Python`_PyEval_EvalFrameDefault + 25555
frame #11: 0x00000001019b9be2 Python`_PyEval_EvalCode + 2008
frame #12: 0x00000001019b01c8 Python`PyEval_EvalCode + 57
frame #13: 0x00000001019ad501 Python`builtin_exec + 581
frame #14: 0x0000000101940ae7 Python`cfunction_vectorcall_FASTCALL + 203
frame #15: 0x00000001019b69b9 Python`_PyEval_EvalFrameDefault + 26425
frame #16: 0x00000001019b9be2 Python`_PyEval_EvalCode + 2008
frame #17: 0x0000000101910daa Python`_PyFunction_Vectorcall + 249
frame #18: 0x00000001019b9086 Python`call_function + 403
frame #19: 0x00000001019b65b5 Python`_PyEval_EvalFrameDefault + 25397
frame #20: 0x0000000101910e1a Python`function_code_fastcall + 97
frame #21: 0x00000001019b9086 Python`call_function + 403
frame #22: 0x00000001019b6599 Python`_PyEval_EvalFrameDefault + 25369
frame #23: 0x0000000101910e1a Python`function_code_fastcall + 97
frame #24: 0x00000001019b9086 Python`call_function + 403
frame #25: 0x00000001019b6653 Python`_PyEval_EvalFrameDefault + 25555
frame #26: 0x0000000101910e1a Python`function_code_fastcall + 97
frame #27: 0x00000001019b9086 Python`call_function + 403
frame #28: 0x00000001019b6653 Python`_PyEval_EvalFrameDefault + 25555
frame #29: 0x0000000101910e1a Python`function_code_fastcall + 97
frame #30: 0x0000000101911e58 Python`object_vacall + 385
frame #31: 0x0000000101911fbc Python`_PyObject_CallMethodIdObjArgs + 248
frame #32: 0x00000001019d4ef8 Python`PyImport_ImportModuleLevelObject + 1152
frame #33: 0x00000001019b4f35 Python`_PyEval_EvalFrameDefault + 19637
frame #34: 0x00000001019b9be2 Python`_PyEval_EvalCode + 2008
frame #35: 0x00000001019b01c8 Python`PyEval_EvalCode + 57
frame #36: 0x00000001019eb059 Python`run_eval_code_obj + 110
frame #37: 0x00000001019ea230 Python`run_mod + 103
frame #38: 0x00000001019e9256 Python`PyRun_StringFlags + 182
frame #39: 0x00000001019e9163 Python`PyRun_SimpleStringFlags + 69
frame #40: 0x0000000101a00241 Python`Py_RunMain + 425
frame #41: 0x0000000101a00b36 Python`pymain_main + 360
frame #42: 0x0000000101a00b8b Python`Py_BytesMain + 42
frame #43: 0x00007fff6cf1ccc9 libdyld.dylib`start + 1
frame #44: 0x00007fff6cf1ccc9 libdyld.dylib`start + 1
I was hoping nothing apart from
vtkDataEncoder
...
That would make sense: it looks as though it's the destructor for vtkDataEncoder
that's triggering the hang, and that destructor would be triggered in the Python code when rebinding the c
name to the next object in the for list. So I think the apparent hang on vtkDataObject
could be explained by problems with vtkDataEncoder
.
I guess a temporary fix is to ignore those classes in get_tvtk_class_names
- and probably other places in mayavi or tvtk which are dynamically trying to infer and instantiate vtk classes.
Yes, I guess so. Seems we need to get one of the existing PRs for hang fixing in.
As a better long-term solution, we might stop to ask why we're doing introspection of (and instantiation of) all the VTK classes at import time. Can we find a better way to do this? For this particular piece of code, it appears that the only reason we're instantiating each class is to try to find out the number of inputs and outputs. Can we get this information in some other way? Do we need to get this information at all? E.g., specifically for Mayavi, would it make sense to explicitly specify the VTK classes we want to use for filters and the like? That would mean having to maintain that list when VTK got updated, but on the other hand it would also mean that what's available in Mayavi would be consistent independent of the VTK version used (with some minimum VTK version required to make sure that everything we need is there).
Upstream issue: https://gitlab.kitware.com/vtk/vtk/-/issues/18344
@mdickinson -- thanks for all the details, this is much easier for me to follow in terms of reproducing the issue. I will go over this issue carefully and report back.
Thank you for this, I can reproduce the lockup. I will work on a fix over the next week. I think in the meanwhile I think can merge #1050. It is long overdue and I can fix the other issues in due course? What do you folks think?
I think can merge #1050. It is long overdue and I can fix the other issues in due course? What do you folks think?
Sounds good to me. I think #1050 just allows mayavi
to install but things still don't quite work (MNE-Python's plotting routines freeze with 9.1.0-pre wheels IIRC), so it's just a first step anyway. I might be able to look more specifically at our failures next week, too. (I have access to Linux and macOS machines so I can do some local testing.)
@rahulporuri @mdickinson -- I do not seem to have admin privileges on this repo anymore (at least I cannot see the settings tab anymore) and am unable to merge #1050, could you please help?
The upstream issue is reported as fixed, and it looks as though the fix will be in 9.1 (but I'm not sure whether I'm reading the GitLab issue log correctly).
I'm unable to run the test suite (under nose) on macOS, with the latest master of Mayavi in a venv: the test suite hangs with no output.
Steps to reproduce: here are the commands I'm issuing, in a zsh shell on macOS 10.15.7 (Intel MacBook Pro). Here
python3.9
is Python 3.9.7, installed via MacPorts.Note: the
python -m pip install -e .
instruction also hangs intermittently (perhaps 1 run out of every 3), but succeeds often enough to allow me to get to the next step.When the installation succeeds, the test run using
nose
hangs with no output other than aRuntimeWarning
aboutnose.core
being found insys.modules
prior to execution ofnose.core
. The hang can't be interrupted with a Ctrl-C, andtop
shows no CPU usage or memory growth in the associated process - IOW the hang resembles a deadlock of some kind. The Python process shows 7 OS-level threads; I'm not sure what's causing the extra threads in something that I'd expect to be single-threaded.There's at least one obvious problem with my setup, which is that I know that the test suite expects some other packages to be present, and will fail without them. However, this still shouldn't be responsible for hanging the test run - we should have a better failure mode here.
I consider this sort of issue to be high priority to diagnose and fix: it makes development work much harder than it needs to be.