Eliminate non-scaling memory consumption to enable iron for High Performance Computing

maierbn commented 6 years ago

Currently iron cannot be used for highly parallel simulations on more than ~10 cores (à 24 ranks). That's because of the memory consumption which rises linearly instead of being constant in a weak-scaling scenario (i.e. number of processes and problem size increased in the same way).

The reason behind this is the way of numbering various items (elements, nodes, dofs, matrix columns etc.) using a GLOBAL_TO_LOCAL map each. This array contains information about the whole problem, on every process. It is mainly used for the setup of the domain_mappings but also accessed afterwards.

Another not scaling data structure is the mesh topology which is used to define the mesh from the user code. It contains a global list of all elements, nodes etc. Once the decomposition is created one can deallocate the mesh, so the hope is that this is not the limiting bottleneck for memory consumption. However it contributes to the asymptotic scaling proportional to problem size despite decomposition and will sooner or later be an issue.

maierbn commented 6 years ago

To fix the memory issue the following steps should be taken:

[x] Create domain mapping for elements without using GLOBAL_TO_LOCAL_MAP
[x] Create domain mapping for nodes without using GLOBAL_TO_LOCAL_MAP
[x] Create domain mapping for dofs without using GLOBAL_TO_LOCAL_MAP
[ ] Make it possible to specify the mesh in a distributed way from the user code or implement distributed file parsing, avoiding global mesh topology
[ ] Rewrite code for distributed matrix/vector handling without using a GLOBAL_TO_LOCAL_MAP for the matrix columns.
[ ] Implement boundary condition handling without GLOBAL_TO_LOCAL_MAP
[ ] Rewrite interpolation/homogenisation between embedded meshes in bioelectric_finite_elasticity_routines.f90 which currently also uses GLOBAL_TO_LOCAL_MAP information

A first attempt for the first three bullet points was started with issue #61 but the resulting code was deleted.

maierbn commented 6 years ago

This repo implements setup of the domain mappings for elements, nodes and dofs without GLOBAL_TO_LOCAL_MAP. For determining the boundary and ghost numbers a lot of local MPI communication is involved and global communication is tried to be avoided.

Elements mapping

If the decomposition is of type CMFE_DECOMPOSITION_CALCULATED_TYPE Parmetis is used to do the element decomposition. If it is CMFE_DECOMPOSITION_USER_DEFINED_TYPE the user code can set the domain for each element with cmfe_Decomposition_ElementDomainSet. This routine can be called asynchronous over the processes, a process does not necessarily specify all nor all of their own elements' domains. In total the information for each element to be on which domain has to be specified by at least one process which can be any process.

Nodes mapping

The internal nodes are decomposed like the adjacent elements. The potential boundary nodes are distributed by a heuristic such that the total number of nodes per process is at most equal.

Dofs mapping

The dofs numbering follows the global nodes numbering and requires a large Allgatherv because the global dof numbers on a process depend on the local numbers of dofs on all previous processes.

The pull request does not delete anything of the old implementation. The new code in mesh_routines.f90 and field_routines.f90 is surrounded by the following constructs:

IF (USE_NEW_LOCAL_IMPLEMENTATION) THEN
  setup field mapping without GLOBAL_TO_LOCAL_MAP
ENDIF

IF (USE_OLD_GLOBAL_IMPLEMENTATION) THEN
  old implementation
ENDIF

IF (USE_OLD_GLOBAL_IMPLEMENTATION.AND.USE_NEW_LOCAL_IMPLEMENTATION) THEN   
  compare field mappings and output a warning if they're different
ENDIF

It works if one sets either one and both of the global variables USE_NEW_LOCAL_IMPLEMENTATION and USE_OLD_GLOBAL_IMPLEMENTATION to true. The generated elements mapping is always equal (except for a minor difference where the old implementation allocates empty arrays for adjacent domains). The generated nodes mapping is in general not equal, because the new implementation uses a heuristic. The resulting number of nodes per domain may still be optimal depending on the test case, but the boundary nodes are distributed differently. For the same nodes mapping the generated dofs mapping is again equal.

An example with test cases for the new code can be found here (based on the laplace example)

Note that the implementation was started from the Stuttgart version of iron and thus contains a lot of other changes that are unrelated to the mappings.

OpenCMISS / iron