In function aws-cdi-sdk/src/cdi/adapter_efa.c : EfaNetworkAdapterInitialize(), shared memory usage is DISABLED at line 1123, with a warning that using shared memory results in rxr_check_cma_capability() being called, causing a fork() and flushing of open files.
However, an indirect call to rxr_check_cma_capability() occurs before at line 1048 (side-effect of fi_getinfo()), so the fork() occurs anyway.
This causes our appllication to hang on the wait() after the fork(), probably because we have open file descriptors.
Setting environment variable FI_EFA_ENABLE_SHM_TRANSFER=0 before running our application fixes the issue, which I think confirms the above.
Comparing with the code for CDI 2, there is no forking call before shared memory usage is disabled, so the issue doesn’t occur in that version.
Also, the cdi_test app from CDI 3.0 doesn’t hang, probably because it is relatively simple, hence the fork() completes normally (I traced the code).
In short, my understanding is that shared memory usage should be disabled earlier in EfaNetworkAdapterInitialize().
Thank you for providing details and the specific problem and locations in the code. Changes to resolve this issue have been pushed to the developer_preview branch here.
In function aws-cdi-sdk/src/cdi/adapter_efa.c : EfaNetworkAdapterInitialize(), shared memory usage is DISABLED at line 1123, with a warning that using shared memory results in rxr_check_cma_capability() being called, causing a fork() and flushing of open files.
However, an indirect call to rxr_check_cma_capability() occurs before at line 1048 (side-effect of fi_getinfo()), so the fork() occurs anyway.
This causes our appllication to hang on the wait() after the fork(), probably because we have open file descriptors.
Setting environment variable FI_EFA_ENABLE_SHM_TRANSFER=0 before running our application fixes the issue, which I think confirms the above.
Comparing with the code for CDI 2, there is no forking call before shared memory usage is disabled, so the issue doesn’t occur in that version. Also, the cdi_test app from CDI 3.0 doesn’t hang, probably because it is relatively simple, hence the fork() completes normally (I traced the code).
In short, my understanding is that shared memory usage should be disabled earlier in EfaNetworkAdapterInitialize().