错误信息:
W NPUStream.cpp:409] Warning: NPU warning, error code is 507046[Error]:
[Error]: In the specified timeout waiting event, all tasks in the specified stream are not completed.
Rectify the fault based on the error information in the ascend log.
EE1002: 2024-09-03-16:50:00.041.231 Stream synchronize timeout. rtDeviceSynchronize execute failed, reason=[stream sync timeout]
Possible Cause: 1. The timeout interval may be improperly set.
Solution: 1. Check whether the timeout interval is properly set. 2. Check whether the network is normal.
TraceBack (most recent call last):
wait for compute device to finish failed, runtime result = 507046.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
(function npuSynchronizeUsedDevices)
训练数据量1.2M,采用16卡进行训练 已经设置 export HCCL_EXEC_TIMEOUT=17340
错误信息: W NPUStream.cpp:409] Warning: NPU warning, error code is 507046[Error]: [Error]: In the specified timeout waiting event, all tasks in the specified stream are not completed. Rectify the fault based on the error information in the ascend log. EE1002: 2024-09-03-16:50:00.041.231 Stream synchronize timeout. rtDeviceSynchronize execute failed, reason=[stream sync timeout] Possible Cause: 1. The timeout interval may be improperly set. Solution: 1. Check whether the timeout interval is properly set. 2. Check whether the network is normal. TraceBack (most recent call last): wait for compute device to finish failed, runtime result = 507046.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] (function npuSynchronizeUsedDevices)