Infinite loop in op_plan_core() triggered with a particular multigrid mesh

OP-DSL / OP2-Common

OP2: open-source framework for the execution of unstructured grid applications on clusters of GPUs or multi-core CPUs

Other

98 stars 47 forks source link

I have a Rotor37 mesh of approximately 8 million nodes, multigrid with 4 levels. If I feed this into the hybrid MPI+OpenMP variant of MG-CFD-OP2 with at least 2 MPI processes, then OP2 appears to become stuck in an infinite loop during a call to op_plan_core() for the highest multigrid mesh.

Interestingly, this issue does not occur with just 1 MPI process, or for the pure OpenMP variant of MG-CFD-OP2. The choice of partitioner has no influence.

To help reproduce the issue I can provide a .json file for feeding into the MG-CFD job generater, this generates a script that compiles and executes MG-CFD. I will have to transfer the mesh offline.

The OP2 branch 'fix/op-plan-core-infinite-loop' contains a check for this infinite loop.

As the partitioners are also having trouble with this mesh, I will expand scope of issue and give more details.

I have analysed the mesh, and confirm it is contiguous - all nodes are reachable by edge-hopping from node 0. Node degrees are sensible, most nodes have 6 neighbours.

PT-Scotch The PT-Scotch partitioner succeeds with this mesh, whether using Geom or KWay.

ParMETIS ParMETIS with the KWay method fails at rank counts of 2 or greater. With the Geom method the fail begins at 25 ranks. For both the error message is:

Poor initial vertex distribution. Processor X has no vertices assigned to it!

This appears for each even-numbered rank X.

If no map is specified, forcing ParMETIS to use the 'trivial block partitioning', then partitioning succeeds.

Inertial Inertial copes with low rank counts, but at approximately 18 and greater it segfaults. For some run and cluster configurations, the segfault occurs at the call to MPI_Comm_dup() just after the main loop; for other configs, the segfault is during the loop.

OP-DSL / OP2-Common

Infinite loop in op_plan_core() triggered with a particular multigrid mesh #161