JGCRI / cassandra

Human-earth system multi-scale model coupling framework
Other
5 stars 3 forks source link

Ensure MPI_ABORT is called on errors #37

Closed rplzzz closed 5 years ago

rplzzz commented 5 years ago

See details here:

https://mpi4py.readthedocs.io/en/stable/mpi4py.run.html

rplzzz commented 5 years ago

Confirmed: this bug will cause the system to hang if one of the components throws an exception. During the development of the MP branch, at one point I was iterating over the dictionary of outstanding tasks and removing the ones that had completed. This should have raised an exception and crashed the system. For example:

d = {1:2, 3:4, 5:6}
for k in d:
    if k==3:
        print(d.pop(k))

When you run this you get:

4
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: dictionary changed size during iteration

When it happened in the MP code, however, the result was for the whole system to hang, with no output to stdout.

rplzzz commented 5 years ago

Note also that the fix proposed in the mpi4py docs doesn't require any change to our code, just a change to how we run it (using python -m mpi4py). I think we should be able to just add this to the shebang line of cassandra_main.py. We'd also want to test it by adding an option to the dummy component to throw an exception.

rplzzz commented 5 years ago

Well, crud. The fix described above only works in mpi4py 2.1.0 and above, and PIC has version 2.0.0 installed. Instead of trying to get them to upgrade, I'm going to see if I can just fix the problem by wrapping our main in a try block.