chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.76k stars 414 forks source link

[Bug]: Chapel runtime hangs on cleanup with ofi+efa #24972

Open jabraham17 opened 2 months ago

jabraham17 commented 2 months ago

The chapel runtime hangs when using COMM=ofi and the EFA provider when run on some codes.

To reproduce, run start_test test/users/bachman/Beta_Diversity/main.chpl. The runtime hangs on fi_close in fini_ofi.

This was previously resolved by https://github.com/chapel-lang/chapel/pull/24232, but this ends up breaking more than it fixes and is being reverted in https://github.com/chapel-lang/chapel/pull/24969

chplenv

CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: gnu
CHPL_TARGET_ARCH: aarch64
CHPL_TARGET_CPU: unknown
CHPL_LOCALE_MODEL: flat
CHPL_COMM: ofi *
  CHPL_LIBFABRIC: system *
CHPL_TASKS: qthreads
CHPL_LAUNCHER: slurm-srun *
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_ATOMICS: cstdlib
  CHPL_NETWORK_ATOMICS: ofi
CHPL_GMP: none *
CHPL_HWLOC: bundled
CHPL_RE2: none *
CHPL_LLVM: unset
CHPL_AUX_FILESYS: none
jhh67 commented 2 months ago

Out of curiousity can you try it with

CHPL_RT_COMM_OFI_INJECT_AM=false
CHPL_RT_COMM_OFI_INJECT_AMO=false
CHPL_RT_COMM_OFI_INJECT_RMA=false
jabraham17 commented 2 months ago

Setting those 3 environment variables causes the test to exit cleanly with no more hangs

jabraham17 commented 2 months ago

In an offline discussion with @jhh67, we concluded that this is a known issue with our runtime. To workaround this, https://github.com/chapel-lang/chapel/pull/24980 will change these 3 environment variables to false until a proper fix is put in. I will leave this open until then.

bradcray commented 2 months ago

Is there any downside in using the env. vars? (e.g., performance hit or the like?) And is the idea to use these just for efa?

jabraham17 commented 2 months ago

Is there any downside in using the env. vars? (e.g., performance hit or the like?)

Quoting from @jhh67, there may be a performance impact on small, non-blocking operations. But we both agreed that was better than a hang/correctness issues.

The PR I opened unconditionally changes these env vars. I think we have primarily seen this issue with EFA, but the problem could happen with other providers. OFI providers may require "manual progression" for non-blocking operations which we don't have, causing the hang.