Open v-dicicco opened 2 months ago
Could you try disabling the use_custom_all_reduce
during building engine?
Disabling use_custom_all_reduce
solved the issue, many thanks!
Does this means there is a bug in the custom all_reduce plugin, and if so: do you think this will be fixed? Furthermore, the description of the flag says that when enabled it should help reducing latency with NVLink setup (my scenario), but there isn't any benchmark...I'm trying to benchmark it but would be really helpful to have a rough idea of the expected impact when it is disabled, are there additionals details available somewhere?
@v-dicicco we are still not quite sure if the bug is inside the kernel or we have to setup something to get it right.
but there isn't any benchmark...
we have the all_reduce tests here.
System Info
Who can help?
@kaiyux @byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
If a process is using TensorRT-LLM to continuously do inference using Mixtral 8x7B with TP (tested with TP=2, 4 and 8) as soon as another process use the same GPU to do inference with ONNX (with Cuda or TensorRT provider), TensorRT-LLM inference will hang. Same behavior happens if you do inference with triton using
tensorrtllm_backend
.Here are the steps to reproduce using code in the repo:
v0.9.0
tag in TensorRT-LLMexample/run.py
script, to continuously do inference. Here is a gist with the script already modified ready to download (the patch would be a bit long), but it just wraps in an infinite loop the warmup/generate of the--run_profiling
code to simulate a process that continuously does inference.--run_profiling
to simulate inference:Download a model & convert to ONNX
huggingface-cli download distilbert/distilbert-base-uncased --local-dir model python -m transformers.onnx -m ./model onnx
python3 run.py
be sure the script is actually using one of the GPUs that TensorRT-LLM is using to run MixtralExpected behavior
Inference with TensorRT-LLM should not hang if processes are using the same GPU
actual behavior
Inference with TensorRT-LLM hangs after ONNX process starts its inference
additional notes
tensorrtllm_backend
) here I'm using therun.py
just to (hopefully) help you reproduce the issue