Closed rplzzz closed 5 years ago
Confirmed: this bug will cause the system to hang if one of the components throws an exception. During the development of the MP branch, at one point I was iterating over the dictionary of outstanding tasks and removing the ones that had completed. This should have raised an exception and crashed the system. For example:
d = {1:2, 3:4, 5:6}
for k in d:
if k==3:
print(d.pop(k))
When you run this you get:
4
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: dictionary changed size during iteration
When it happened in the MP code, however, the result was for the whole system to hang, with no output to stdout
.
Note also that the fix proposed in the mpi4py
docs doesn't require any change to our code, just a change to how we run it (using python -m mpi4py
). I think we should be able to just add this to the shebang line of cassandra_main.py
. We'd also want to test it by adding an option to the dummy component to throw an exception.
Well, crud. The fix described above only works in mpi4py 2.1.0 and above, and PIC has version 2.0.0 installed. Instead of trying to get them to upgrade, I'm going to see if I can just fix the problem by wrapping our main in a try block.
See details here:
https://mpi4py.readthedocs.io/en/stable/mpi4py.run.html