When build in debug mode, ParMmg fails with the following error message:
## Error: PMMG_check_intNodeComm: rank <n>:
A point shared by at least 2 groups has 2 positions (<k1> and <k2>) in the internal communicator (dist = <dist>):
...
Steps to reproduce the issue
environment:
ubuntu:22.04 or ubuntu:20.04
openMPI:4.0.3 (bug has also been reproduced with mpich)
gcc-9.4.0
x86_64 or aarch64 architecture
checkout the commit c7549f120 of ParMmg. It should use the commit 2263f92c of Mmg;
build ParMmg in Debug mode (-DCMAKE_BUILD_TYPE=Debug) with USE_POINTMAP enabled (default behaviour) and USE_SCOTCH disabled; ⚠️ Enabling Scotch solves the issue!
enable the continuous integration tests (-DBUILD_TESTING=ON`)
run the multidom_wave-8 test (that succeed), then the multidom_wave-8-rerun test that should fail.
the compare function provided to qsort compares the values to within a epsilon thus, x may me evaluated as equal to x+ε and x+ε evaluated equal to x+2ε, leading to an erroneous sort: x < x+2ε < x+ε. When travelling the array to delete the duplicated point, as the comparison function evaluates x as different from x+2ε, the point x+ε is not deleted, even if it should be evaluated as a duplication of x. In practice, we found values exactly identics separated by other values in the sorted array...
using exact comparisons inside qsort (and approximated comparisons for the duplication detection) doesn't work either:
as we compare double precision values, 2 values with the same x-coordinate mais be evaluated as differents due to the epsilon machine (noted ε_m) leading, again, to an erroenous sort. Lets take 3 points with the following coordinates : A ( x,y); B(x,y_2>>y) and C (x+ε_m,y). C is a duplication of A while B is a different point (y is greater). We end up with the following order : A, B, C
as B is different from A, again, the duplication removal fails
Issue description
When build in debug mode, ParMmg fails with the following error message:
Steps to reproduce the issue
ubuntu:22.04
orubuntu:20.04
openMPI:4.0.3
(bug has also been reproduced with mpich)gcc-9.4.0
x86_64
oraarch64
architecturec7549f120
of ParMmg. It should use the commit2263f92c
of Mmg;-DCMAKE_BUILD_TYPE=Debug
) withUSE_POINTMAP
enabled (default behaviour) andUSE_SCOTCH
disabled; ⚠️ Enabling Scotch solves the issue!multidom_wave-8
test (that succeed), then themultidom_wave-8-rerun
test that should fail.Investigation
The error is linked to inconsistencies when sorting the
coor_list
array that contains the list of the coordinates of the points stored in the internal communicators and when travelling this sorted array to remove the duplicated points (https://github.com/MmgTools/ParMmg/blob/beb4147b7b29a2891e1a55745ff95863e93bcc0d/src/communicators_pmmg.c#L1788-L1811):qsort
compares the values to within a epsilon thus,x
may me evaluated as equal tox+ε
andx+ε
evaluated equal tox+2ε
, leading to an erroneous sort:x < x+2ε < x+ε
. When travelling the array to delete the duplicated point, as the comparison function evaluatesx
as different fromx+2ε
, the pointx+ε
is not deleted, even if it should be evaluated as a duplication ofx
. In practice, we found values exactly identics separated by other values in the sorted array...qsort
(and approximated comparisons for the duplication detection) doesn't work either:ε_m
) leading, again, to an erroenous sort. Lets take 3 points with the following coordinates :A ( x,y)
;B(x,y_2>>y)
andC (x+ε_m,y)
. C is a duplication of A while B is a different point (y is greater). We end up with the following order : A, B, C