Parallel tests still hang

edaub commented 4 years ago

I still have issues with the parallel tests hanging. It seems to occur reliably with 4 processes when running a sufficient fraction of the test suite and hangs without a clear reason why. (2 processes seems reliably fine.) When I just run one file at a time it appears to be fine, but once I add enough other tests it seems to cause issues. Changing the order doesn't seem to have a predictable pattern associated with it, but it is usually the LinearSolver tests or the ForcingCovariance tests that are most prone to this problem.

I don't think it is a memory usage issue -- the base test suite consistently uses ~100 MB of memory, and the parallel test suite uses roughly 4 times that as expected. This amount is nowhere near what would be needed to cause any issues.

For whatever reason, it never seems to have trouble on the Travis CI, but on my Mac or in my own version of the Docker container it is more problematic.

However, I have no idea at this point what else could be causing this other than some MPI issue that I have failed to uncover.

edaub commented 4 years ago

An update after some further investigation:

I can now rebuild firedrake's docker container manually (they seem to have resolved the install issue I was having), which seems to run the tests without any issues. The previous docker image I was using seems to be older, and when I ran firedrake-update in the container it seems to have resolved the issues. My recollection is that I was having these issues on my mac when running an older version, which eventually resolved itself (possibly when I did an update).

Returning to the old container, I looked at the output of pip freeze and updated the packages that were different, which did not fix the problem, so it is probably something in the firedrake codebase (or one of its dependencies). The older version has the following commit hashes for the Firedrake components:

COFFEE:    70c1e66a4e4e39d3bf75274505a16901af110751
fiat:      d085f35723a6c2992009f8f21b7d14485c338b8d
ufl:       6b09484b74ed3369a6c3c5756712ec44dae18ba3
FInAT:     1e8838888f4672ca0191bb25d58041a5f1635f89
firedrake: e98fe7cfa05182a136f52fc0325110e32726295f
h5py:      c69fc627c96aafcc1393bb70115e5bcd3a6f8a95
loopy:     83c0ce0a5749c53b05fb2bd0f94c244854169bc5
PyOP2:     da66688c436c7f2001d97ec7174255c7f0b05aed
tsfc:      f14bb4f394a438e7f68d8f6598f11c3f7d4f87ee

The updated version with correctly working components is:

COFFEE:    70c1e66a4e4e39d3bf75274505a16901af110751
fiat:      1d562d641079583b650eed7f5eb863fcaca4df98
ufl:       e3b8752d8f72f851fd24d29c0be9419765997985
FInAT:     273d5703c8d2b0d4e4600fc983147dc2b26a72c4
firedrake: 93dfb5c5328c04d34bb8313a52185cb7fd144fde
h5py:      c69fc627c96aafcc1393bb70115e5bcd3a6f8a95
loopy:     4f2b39411ebbc687a98cc8804bed04400cd1b475
PyOP2:     dac72025f35294d2f73d1e2316bd6c3732f90778
tsfc:      dcf921e26d8f05b49bdb1a53fa3a8f09b88cf29a

I do not really want to dig into which of these commits work and which do not, but I will note these here in case people run into trouble. Because it seems to work okay for me now, I will close the issue, but will re-open if I run into further issues.

edaub commented 4 years ago

Still having some issues, as this popped up again when I reworked the travis build (which curiously was a situation where they tended to not hang in the past...)

edaub commented 4 years ago

I think I have figured out the problem here.

After extensive testing, I found that tests tended to hang at the end of a test, where a print statement at the end of one test executed but one placed at the start of the next one did not. This means the problem is not with any of the test code or my code, but rather something occurring between the tests, such as test or garbage collection.

Poking around on the Firedrake issues, I found the following board on parallel issues in garbage collection:

https://github.com/firedrakeproject/firedrake/projects/6

This suggests there might be issues when underlying Firedrake/PETSc/PyOP2 objects are garbage collected, which is likely to happen between successive tests. This would be consistent with the following observations I have noted on this occurring:

I originally had this problem extensively back in December, and found that refactoring the tests to use fixtures improved the situation. Initially, the tests would always hang, while using fixtures (which are re-used between successive tests and thus would reduce the amount of garbage collection needed) made it so that it only locked some of the time.
Running only a subset of tests decreased the chances of deadlock, as fewer collections take place.
Updates to Firedrake to fix issues in garbage collection on occasion improved things (see comment above).

To confirm, I have disabled garbage collection during the tests, and found this causes them to pass reliably without locking. This appears to be a workaround suggested by the Firedrake developers (see above link). I will merge this into the code, and revist as needed to see if further upstream fixes make this unnecessary.

For users, if you have issues with code deadlocking and hanging, then a workaround for this can be done by putting

import gc
gc.disable()

At the start of any scripts, and then call gc.collect() at points when you know that parallel operations will have completed.

alan-turing-institute / stat-fem

Parallel tests still hang #16