Open jabraham17 opened 2 months ago
Out of curiousity can you try it with
CHPL_RT_COMM_OFI_INJECT_AM=false
CHPL_RT_COMM_OFI_INJECT_AMO=false
CHPL_RT_COMM_OFI_INJECT_RMA=false
Setting those 3 environment variables causes the test to exit cleanly with no more hangs
In an offline discussion with @jhh67, we concluded that this is a known issue with our runtime. To workaround this, https://github.com/chapel-lang/chapel/pull/24980 will change these 3 environment variables to false until a proper fix is put in. I will leave this open until then.
Is there any downside in using the env. vars? (e.g., performance hit or the like?) And is the idea to use these just for efa?
Is there any downside in using the env. vars? (e.g., performance hit or the like?)
Quoting from @jhh67, there may be a performance impact on small, non-blocking operations. But we both agreed that was better than a hang/correctness issues.
The PR I opened unconditionally changes these env vars. I think we have primarily seen this issue with EFA, but the problem could happen with other providers. OFI providers may require "manual progression" for non-blocking operations which we don't have, causing the hang.
The chapel runtime hangs when using COMM=ofi and the EFA provider when run on some codes.
To reproduce, run
start_test test/users/bachman/Beta_Diversity/main.chpl
. The runtime hangs onfi_close
infini_ofi
.This was previously resolved by https://github.com/chapel-lang/chapel/pull/24232, but this ends up breaking more than it fixes and is being reverted in https://github.com/chapel-lang/chapel/pull/24969
chplenv