Closed jglaser closed 11 months ago
Jens Glaser thanks for the fix. We will pull this in shortly.
Some function in the rocm_smi_lib only allows one process to access at a time. The rocm_smi_lib used an inter-process mutex to protect it. If another process is using this function, the process will call pthread_mutex_timedlock() to wait 5 seconds.
The error code 110 means timeout. After 5 seconds, another process does not release the mutex, then the init() is fail with above errors.
Is the number in the error message process id (i.e. 347 in the below example)? How many processes are trying to call the rsmi_init(0) at the same time, and how many processes are successful? If we attach gdb to successful processes, did it block at some rocm_smi_lib function? 347: pthread_mutex_timedlock() returned 110
Thanks.
Bill, could you specify which function requires the mutex?
Yes, the number in front of the ":" is the global process rank. Eight processes per node are calling the rsmi_init()
at the same time. Typically 80% of nodes make it through, but beyond ~16 nodes there is always a high probability for one node to fail.
I'll have to attach the debugger and will let you know.
A lot of rocm_smi function require the mutex. In the unit test, you can find most of them.
I tried to reproduce this issue in my machine (I only have 1 GPU) with 1000 process and no lucky. You said "beyond ~16 nodes there is always a high probability", do you mean you have 16 computers, and each had 8 GPUs? Thanks.
I am also encountering the same issue. Line number points to static rsmi_status_t status = rsmi_init(0); I am using ROCm 5.2.3 version. is there any fix or workaround in place ? Happening only in CI, could not reproduce in my machine yet.
I use the below 4 ROCm API's to monitor, when I add the 4th one it started giving me this error. 1)auto status = rsmi_dev_temp_metric_get(m_smiDeviceIndex, sensorType, metric, &newValue); => OK 2)auto status = rsmi_dev_gpu_clk_freq_get(m_smiDeviceIndex, m_clockMetrics[i], &freq); => OK 3)auto status = rsmi_dev_fan_rpms_get(m_smiDeviceIndex, m_fanMetrics[i], &newValue); => OK 4)auto status = rsmi_dev_gpu_metrics_info_get(m_smiDeviceIndex, &gpuMetrics); ==> NOT OK
Below is the error message: pthread_mutex_timedlock() returned 131 Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success terminate called after throwing an instance of 'std::runtime_error' what(): Error 8(An error occurred during initialization, during monitor discovery or when when initializing internal data structures) /tmp/.tensile-tox/py3/lib/python3.8/site-packages/Tensile/Source/client/source/HardwareMonitor.cpp:144:
A lot of rocm_smi function require the mutex. In the unit test, you can find most of them.
I tried to reproduce this issue in my machine (I only have 1 GPU) with 1000 process and no lucky. You said "beyond ~16 nodes there is always a high probability", do you mean you have 16 computers, and each had 8 GPUs? Thanks.
yes.... see here https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html
76b5528 and 160c99d should address it.
When using pytorch with the NCCL/RCCL backend on a system with eight GPUs/node, I get initialization failures of the following kind:
The reason is that rocm_smi_lib creates a mutex in
/dev/shm
whose name is independent of the process id, which creates a race condition.