CI crashing out with `exit code 137`

firedrakeproject / firedrake

Firedrake is an automated system for the portable solution of partial differential equations using the finite element method (FEM)

https://firedrakeproject.org

Other

517 stars 160 forks source link

CI crashing out with `exit code 137` #2824

Open JDBetteridge opened 1 year ago

JDBetteridge commented 1 year ago

We are running out of memory on the CI. I have now profiled the test suite and found the source of the issue to be the test test_firedrake_helmholtz_scalar_convergence_on_hex in tests/regression/test_helmholtz.py introduced in commit e68b9f63644fce64fd23c85031261e29aae2dd88.

This issue should be treated as fairly urgent as currently it is not possible to run tests locally unless you have a pretty beefy machine (at least 64GB RAM!) and even then you are at danger of running out of memory if using xdist to parallelise pytest.

I attach two plots showing the memory profile for the test suite (run without xdist).

full_test_suite test_helmholtz

ksagiyam commented 1 year ago

This test is large as it is a convergence test on hex mesh, and the hex mesh must be such that it contains all possible facet orientations. I might have to create a reasonable mesh by hand.

JDBetteridge commented 1 year ago

We should also have some policy on acceptable test sizes. Off the top of my head a good starting point would be:

Test duration <1minute
MPI ranks <=4
Total memory <4GB

Runner hardware is currently 48 physical cores, 64GB RAM. Four Github runners share this hardware, tests are run using pytest xdist with -n 12 (currently -n 8 to try and mitigate this issue).

With possible exceptions being number of ranks could be greater for testing communicator functionality, or for testing Ensemble. In these cases we should only break one of these limits.

We should also concretise these limits on the wiki.

JDBetteridge commented 1 year ago

@ksagiyam I think constructing the mesh by hand to reduce the size would be a good idea. Could you split the problem up and have a set of meshes which together cover all possible orientations, rather than one big mesh containing all orientations?

ksagiyam commented 1 year ago

That could be a good option, actually.

dham commented 1 year ago

Does this need to be a convergence test at all? I presume this is basically an orientations test. If you did the test based on data in a polynomial space of degree no higher than the elements then the operations should be exact up to machine precision and you could instead check for near zero error.

wence- commented 1 year ago

Or compute some cohomology which is topological, but I presume would be sensitive to orientations being incorrect.

ksagiyam commented 1 year ago

Ok, sounds good. For now let me just quickly do polynomial interpolation tests to fix CI.

connorjward commented 1 year ago

@JDBetteridge can this be closed?

JDBetteridge commented 1 year ago

Can we leave it open until we add something to the wiki?