aws / aws-cdi-sdk

AWS Cloud Digital Interface (CDI) SDK. Documentation at: https://aws.github.io/aws-cdi-sdk/mainline/index.html
BSD 2-Clause "Simplified" License
59 stars 20 forks source link

Hang during sdk efa initialization (sdk 3.0.0) #104

Closed stefd closed 1 year ago

stefd commented 1 year ago

In function aws-cdi-sdk/src/cdi/adapter_efa.c : EfaNetworkAdapterInitialize(), shared memory usage is DISABLED at line 1123, with a warning that using shared memory results in rxr_check_cma_capability() being called, causing a fork() and flushing of open files.

However, an indirect call to rxr_check_cma_capability() occurs before at line 1048 (side-effect of fi_getinfo()), so the fork() occurs anyway.

This causes our appllication to hang on the wait() after the fork(), probably because we have open file descriptors.

Setting environment variable FI_EFA_ENABLE_SHM_TRANSFER=0 before running our application fixes the issue, which I think confirms the above.

Comparing with the code for CDI 2, there is no forking call before shared memory usage is disabled, so the issue doesn’t occur in that version. Also, the cdi_test app from CDI 3.0 doesn’t hang, probably because it is relatively simple, hence the fork() completes normally (I traced the code).

In short, my understanding is that shared memory usage should be disabled earlier in EfaNetworkAdapterInitialize().

mhhen commented 1 year ago

Thank you for providing details and the specific problem and locations in the code. Changes to resolve this issue have been pushed to the developer_preview branch here.