erpc-io / eRPC

Efficient RPCs for datacenter networks
Other
851 stars 138 forks source link

Nexus segfault when there are offline CPUs #31

Open yilongli opened 5 years ago

yilongli commented 5 years ago

I got a segfault when running the create_session_test at the following line: https://github.com/erpc-io/eRPC/blob/dff45896f7bb6bf43f38a8d7a7034c6ff79791f7/src/nexus_impl/nexus.cc#L63

The problem is that sm_thread_lcore_index is assigned to be the last lcore at line 61 without considering its status while get_lcores_for_numa_node returns only online lcores.

yilongli commented 5 years ago

BTW, the code in numautils.h seems to assume that there is an equal number of (online) lcores in each numa node, which is quite fragile.

anujkaliaiitd commented 5 years ago

Hi, Yilong. The approach suggested in this issue would be nice to have in eRPC. Machines with offline CPUs are uncommon IMO, so this is a low-priority task for us. We would welcome a patch.

As a temporary workaround, you might hard-code the core for the session management thread. Or, you might delete the core pinning for this thread altogether. The session management thread has near-zero CPU use when sessions aren't being actively created or destroyed, so my hope is that disabling core pinning won't affect performance.

yilongli commented 5 years ago

I had hyperthreading turned off so half of the CPUs were offline. I agree that machines with offline CPUs are rare in production but it's quite convenient for doing experiment. Anyway, I might submit a patch if this becomes more problematic for me. Thanks.

anujkaliaiitd commented 5 years ago

Ah - I didn't think of the HT-disabled case. That's a scenario that we would like to support.