Initialization sometimes fails on multi-GPU nodes due to race condition

jglaser commented 2 years ago

When using pytorch with the NCCL/RCCL backend on a system with eight GPUs/node, I get initialization failures of the following kind:

347: pthread_mutex_timedlock() returned 110
 347: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 348: pthread_mutex_timedlock() returned 110
 348: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 757: pthread_mutex_timedlock() returned 110
 757: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 350: pthread_mutex_timedlock() returned 110
 350: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 753: pthread_mutex_timedlock() returned 110
 753: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 351: pthread_mutex_timedlock() returned 110
 351: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 756: pthread_mutex_timedlock() returned 110
 756: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 758: pthread_mutex_timedlock() returned 110
 758: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
1050: pthread_mutex_timedlock() returned 110
1050: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
 347: rsmi_init() failed
1052: pthread_mutex_timedlock() returned 110

The reason is that rocm_smi_lib creates a mutex in /dev/shm whose name is independent of the process id, which creates a race condition.

hkasivis commented 2 years ago

Jens Glaser thanks for the fix. We will pull this in shortly.

bill-shuzhou-liu commented 2 years ago

Some function in the rocm_smi_lib only allows one process to access at a time. The rocm_smi_lib used an inter-process mutex to protect it. If another process is using this function, the process will call pthread_mutex_timedlock() to wait 5 seconds.

The error code 110 means timeout. After 5 seconds, another process does not release the mutex, then the init() is fail with above errors.

Is the number in the error message process id (i.e. 347 in the below example)? How many processes are trying to call the rsmi_init(0) at the same time, and how many processes are successful? If we attach gdb to successful processes, did it block at some rocm_smi_lib function? 347: pthread_mutex_timedlock() returned 110

Thanks.

jglaser commented 2 years ago

Bill, could you specify which function requires the mutex?

Yes, the number in front of the ":" is the global process rank. Eight processes per node are calling the rsmi_init() at the same time. Typically 80% of nodes make it through, but beyond ~16 nodes there is always a high probability for one node to fail.

I'll have to attach the debugger and will let you know.

bill-shuzhou-liu commented 2 years ago

A lot of rocm_smi function require the mutex. In the unit test, you can find most of them.

I tried to reproduce this issue in my machine (I only have 1 GPU) with 1000 process and no lucky. You said "beyond ~16 nodes there is always a high probability", do you mean you have 16 computers, and each had 8 GPUs? Thanks.

aferoz21 commented 1 year ago

I am also encountering the same issue. Line number points to static rsmi_status_t status = rsmi_init(0); I am using ROCm 5.2.3 version. is there any fix or workaround in place ? Happening only in CI, could not reproduce in my machine yet.

I use the below 4 ROCm API's to monitor, when I add the 4th one it started giving me this error. 1)auto status = rsmi_dev_temp_metric_get(m_smiDeviceIndex, sensorType, metric, &newValue); => OK 2)auto status = rsmi_dev_gpu_clk_freq_get(m_smiDeviceIndex, m_clockMetrics[i], &freq); => OK 3)auto status = rsmi_dev_fan_rpms_get(m_smiDeviceIndex, m_fanMetrics[i], &newValue); => OK 4)auto status = rsmi_dev_gpu_metrics_info_get(m_smiDeviceIndex, &gpuMetrics); ==> NOT OK

Below is the error message: pthread_mutex_timedlock() returned 131 Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success terminate called after throwing an instance of 'std::runtime_error' what(): Error 8(An error occurred during initialization, during monitor discovery or when when initializing internal data structures) /tmp/.tensile-tox/py3/lib/python3.8/site-packages/Tensile/Source/client/source/HardwareMonitor.cpp:144:

jglaser commented 1 year ago

A lot of rocm_smi function require the mutex. In the unit test, you can find most of them.

I tried to reproduce this issue in my machine (I only have 1 GPU) with 1000 process and no lucky. You said "beyond ~16 nodes there is always a high probability", do you mean you have 16 computers, and each had 8 GPUs? Thanks.

yes.... see here https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html

dmitrii-galantsev commented 11 months ago

76b5528 and 160c99d should address it.

ROCm / rocm_smi_lib

Initialization sometimes fails on multi-GPU nodes due to race condition #88