Open liu21yd opened 5 months ago
Could you try adding --use_custom_all_reduce disable
when you build the engine?
Could you try adding
--use_custom_all_reduce disable
when you build the engine?
I face the same error and the solution works!
The issue is caused by the network topo. If your network topo does not support peer access, then custom_all_reduce
is not supported and need to disable it.
Could you try adding
--use_custom_all_reduce disable
when you build the engine?
It works , thank you very much.
I built TensoRT-LLM 0.9.0 from source code base on nvcr.io/nvidia/tritonserver:24.02-py3 , and run scripts or commands from https://github.com/NVIDIA/TensorRT-LLM/blob/main/docker/Dockerfile.multi.
I convert the checkpoint and built engine successfully by following command:
When I start triton server, I Encountered an error:
The NCCL INFO is :
Who can help me?