NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.6k stars 829 forks source link

Mixtral with TP hangs indefinitely if another process uses the same GPU with ONNX #1601

Open v-dicicco opened 2 months ago

v-dicicco commented 2 months ago

System Info

Who can help?

@kaiyux @byshiue

Information

Tasks

Reproduction

If a process is using TensorRT-LLM to continuously do inference using Mixtral 8x7B with TP (tested with TP=2, 4 and 8) as soon as another process use the same GPU to do inference with ONNX (with Cuda or TensorRT provider), TensorRT-LLM inference will hang. Same behavior happens if you do inference with triton using tensorrtllm_backend.

Here are the steps to reproduce using code in the repo:

  1. Checkout v0.9.0 tag in TensorRT-LLM
  2. Modify the example/run.py script, to continuously do inference. Here is a gist with the script already modified ready to download (the patch would be a bit long), but it just wraps in an infinite loop the warmup/generate of the --run_profiling code to simulate a process that continuously does inference.
  3. Convert and build Mixtral 8x7B using TP=2 (I've used int8), then run inference using the --run_profiling to simulate inference:
    mpirun -n 2 python3 run.py --max_output_len 100 --tokenizer_dir <path_to_tokenizer> --engine_dir <path_to_trt_mixtral>  --run_profiling
  4. The code will iterate the profiling. Please ignore the actual numbers, it is just to simulate inference:
    batch_size: 1, avg latency of 1 iterations: : 1.442868947982788 sec
    batch_size: 1, avg latency of 1 iterations: : 1.4428672790527344 sec
    batch_size: 1, avg latency of 1 iterations: : 2.886024236679077 sec
    batch_size: 1, avg latency of 1 iterations: : 2.8860158920288086 sec
    batch_size: 1, avg latency of 1 iterations: : 4.32866644859314 sec
    batch_size: 1, avg latency of 1 iterations: : 4.328623056411743 sec
    batch_size: 1, avg latency of 1 iterations: : 5.7726891040802 sec
    batch_size: 1, avg latency of 1 iterations: : 5.772603273391724 sec
    batch_size: 1, avg latency of 1 iterations: : 7.215593338012695 sec
    batch_size: 1, avg latency of 1 iterations: : 7.215665578842163 sec
    batch_size: 1, avg latency of 1 iterations: : 8.658446550369263 sec
    batch_size: 1, avg latency of 1 iterations: : 8.6583890914917 sec
  5. Run another, external, script using ONNX on the same GPU. Here is a quick setup:
    
    ### setup env:
    python3.10 -m venv venv
    source venv/bin/activate
    pip install transformers[onnx] torch
    pip install onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/

Download a model & convert to ONNX

huggingface-cli download distilbert/distilbert-base-uncased --local-dir model python -m transformers.onnx -m ./model onnx

6. Create the following `inference.py`
```python
from transformers import AutoTokenizer
from onnxruntime import InferenceSession, SessionOptions

import torch

providers = [("CUDAExecutionProvider", {"device_id": torch.cuda.current_device()})]

sess_options = SessionOptions()
session = InferenceSession("onnx/model.onnx", session_options=sess_options, providers=providers)

tokenizer = AutoTokenizer.from_pretrained("model")
inputs = tokenizer("Using DistilBERT with ONNX Runtime!", return_tensors="np")

while True:
    outputs = session.run(output_names=["last_hidden_state"], input_feed=dict(inputs))
    print(outputs[0].shape)
  1. run python3 run.py be sure the script is actually using one of the GPUs that TensorRT-LLM is using to run Mixtral
  2. As soon as the ONNX script will actually start to do inference, the TensorRT-LLM process will hang (you will see it stops printing). Even if you stop the ONNX process, TensorRT-LLM will not recover.

Expected behavior

Inference with TensorRT-LLM should not hang if processes are using the same GPU

actual behavior

Inference with TensorRT-LLM hangs after ONNX process starts its inference

additional notes

byshiue commented 2 months ago

Could you try disabling the use_custom_all_reduce during building engine?

v-dicicco commented 2 months ago

Disabling use_custom_all_reduce solved the issue, many thanks!

Does this means there is a bug in the custom all_reduce plugin, and if so: do you think this will be fixed? Furthermore, the description of the flag says that when enabled it should help reducing latency with NVLink setup (my scenario), but there isn't any benchmark...I'm trying to benchmark it but would be really helpful to have a rough idea of the expected impact when it is disabled, are there additionals details available somewhere?

PerkzZheng commented 2 months ago

@v-dicicco we are still not quite sure if the bug is inside the kernel or we have to setup something to get it right.

but there isn't any benchmark...

we have the all_reduce tests here.