ICLDisco / parsec

PaRSEC is a generic framework for architecture aware scheduling and management of micro-tasks on distributed, GPU accelerated, many-core heterogeneous architectures. PaRSEC assigns computation threads to the cores, GPU accelerators, overlaps communications and computations and uses a dynamic, fully-distributed scheduler based on architectural features such as NUMA nodes and algorithmic features such as data reuse.
Other
50 stars 17 forks source link

Termination detection fault with dtd #634

Open abouteiller opened 9 months ago

abouteiller commented 9 months ago

Describe the bug

Seen only once on #321, need to see if it also happens on master

To Reproduce

29302 Command: "/apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/openmpi-4.1.5-2rgaqk2wseegpmbdbbygvwrljccjaqsk/bin/mpiexec" "-n" "4" "dsl/did/dtd_test_task_insertion"
...
 29329 dtd_test_task_insertion: /home/bouteill/parsec/dplasma/parsec/parsec/mca/termdet/local/termdet_local_module.c:114: parsec_termdet_local_termination_dete
       cted: Assertion `tp->tdm.monitor == PARSEC_TERMDET_LOCAL_TERMINATED' failed.
 29330 [leconte:4113702] *** Process received signal ***
 29340 [leconte:4113702] [ 7] /home/bouteill/parsec/dplasma/build.cuda/parsec/parsec/libparsec.so.4(+0xb042f)[0x7fa93d0fc42f]
 29341 [leconte:4113702] [ 8] /home/bouteill/parsec/dplasma/build.cuda/parsec/parsec/libparsec.so.4(parsec_release_dtd_task_to_mempool+0x32)[0x7fa93d0ce596]
 29342 [leconte:4113702] [ 9] /home/bouteill/parsec/dplasma/build.cuda/parsec/parsec/libparsec.so.4(__parsec_complete_execution+0xc6)[0x7fa93d0b7f70]
 29343 [leconte:4113702] [10] /home/bouteill/parsec/dplasma/build.cuda/parsec/parsec/libparsec.so.4(__parsec_task_progress+0x12e)[0x7fa93d0b80ca]
 29344 [leconte:4113702] [11] /home/bouteill/parsec/dplasma/build.cuda/parsec/parsec/libparsec.so.4(__parsec_context_wait+0x2ee)[0x7fa93d0b8c0a]
 29345 [leconte:4113702] [12] /home/bouteill/parsec/dplasma/build.cuda/parsec/parsec/libparsec.so.4(+0x49343)[0x7fa93d095343]
abouteiller commented 9 months ago

this is somewhat rare but can be reproduced for sure with ctest --repeat until-fail:100

Another variant

dtd_test_task_insertion: /home/bouteill/parsec/dplasma/parsec/parsec/mca/termdet/local/termdet_local_module.c:106: parsec_termdet_local_taskpool_state: Assertion `0' failed.
abouteiller commented 9 months ago

I have it on master