SCOREC / core

parallel finite element unstructured meshes
Other
178 stars 63 forks source link

Expecting one or two elements per part from split and zsplit #424

Open eisungy opened 3 months ago

eisungy commented 3 months ago

Hi. One of application codes I'm involved with is using split and zsplit to partition 982 faces of a serial mesh. The application code is based on discontinuous galerkin method, and users want to distribute the mesh to get about one or two elements per MPI rank.

However, for some cases, split and zsplit didn't work. Below is from the parameter scan.

(1) when the number of parts is purely power of 2, # of parts split zsplit
32 O O
256 O O
512 O X
(2) when the number of parts is NOT power of 2, # of parts split zsplit
48 X O
96 X O
288 X O
336 X X
384 X X

Since the users' computer cluster equips 48 cores per one computing node, they want to partition the mesh to get multiples of 48. Since the number of elements for the mesh is small, one idea I'm thinking of is for each rank to load the entire same mesh without partitioning. But for that case, I'm worried that PUMI didn't work because all meshes are duplicated and they don't have partition map or any partition.

In sum, I have two questions.

  1. Is there limitation in split and zsplit as the ratio of number of elmenets to total MPI ranks is close to 1?
  2. If the limitation exists, is there a recommending way to handle that kind of application with PUMI?

Thanks.

cwsmith commented 3 months ago

Hi @eisungy,

We typically don't run PUMI with so few elements per part (MPI rank).

I'm worried that PUMI didn't work because all meshes are duplicated and they don't have partition map or any partition.

Your concern is correct; without a partition of the mesh none of the PUMI distributed functions will work as expected.

  1. IIRC, there is no guarantee that Zoltan/Parmetis (used by zsplit) won't create empty parts. We have not tested split down to the levels described here. I'd have to see the error logs to say more. For one of the failed cases, would you please provide the input mesh, build info, execution command (split/zsplit and the arguments), and error logs. I can't give an estimate of how soon someone will be able to do a deep dive on the bug, but maybe we'll see something in the error log.

  2. I can't think of something offhand.

eisungy commented 2 months ago

split_err_test.tar.gz Hi @cwsmith , Thank you for your answer. I upload mesh files with error messages returned by split for 48/96/144 parts. (split.err.XX files)

All results printed out one below message only.

(1 << depth) == multiple failed at /home/esyoon/src/core/core-master-20240315/parma/rib/parma_mesh_rib.cc + 69

I couldn't include an error message from zsplit for the 336 parts case in the attached file, but its error message is as below.

APF warning: 9 empty parts
numDc+numIso >= 1 failed at /home/esyoon/src/core/core-master-20240315/parma/diffMC/parma_dcpart.cc + 124
numDc+numIso >= 1 failed at /home/esyoon/src/core/core-master-20240315/parma/diffMC/parma_dcpart.cc + 124
numDc+numIso >= 1 failed at /home/esyoon/src/core/core-master-20240315/parma/diffMC/parma_dcpart.cc + 124
numDc+numIso >= 1 failed at /home/esyoon/src/core/core-master-20240315/parma/diffMC/parma_dcpart.cc + 124
numDc+numIso >= 1 failed at /home/esyoon/src/core/core-master-20240315/parma/diffMC/parma_dcpart.cc + 124
numDc+numIso >= 1 failed at /home/esyoon/src/core/core-master-20240315/parma/diffMC/parma_dcpart.cc + 124
numDc+numIso >= 1 failed at /home/esyoon/src/core/core-master-20240315/parma/diffMC/parma_dcpart.cc + 124
numDc+numIso >= 1 failed at /home/esyoon/src/core/core-master-20240315/parma/diffMC/parma_dcpart.cc + 124
numDc+numIso >= 1 failed at /home/esyoon/src/core/core-master-20240315/parma/diffMC/parma_dcpart.cc + 124

It is a warning, but couldn't get any resultant partitioned files.

Thank you for your investigation.