NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.62k stars 831 forks source link

issue with Device 0 peer access Device x is not available. #1487

Open geraldstanje opened 3 months ago

geraldstanje commented 3 months ago

System Info

gpu:

Mon Apr 22 17:00:40 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    On  | 00000000:00:1B.0 Off |                    0 |
|  0%   17C    P8              15W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A10G                    On  | 00000000:00:1C.0 Off |                    0 |
|  0%   16C    P8              15W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A10G                    On  | 00000000:00:1D.0 Off |                    0 |
|  0%   16C    P8              15W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A10G                    On  | 00000000:00:1E.0 Off |                    0 |
|  0%   16C    P8              15W / 300W |      0MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

- tensorrtllm_backend: 0.8.0
- model: Llama-2-7b-chat-hf (https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
- docker image: ```nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3```
run docker image: ```sudo docker run -it --ipc=host --gpus all --ulimit memlock=-1 --shm-size="2g" nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3 /bin/bash```

### Who can help?

@kaiyux @byshiue

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

1. run docker image: nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3
2. install tensorrt v8.0
3. compile model
4. run model

### Expected behavior

all gpus should be visible and should be able to use to compile and run the model

### actual behavior

why are the 3 other gpus not available?

[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available. [TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available. [TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available.


logs:

./llama2_llm_tensorrt_engine_build_and_test.sh [TensorRT-LLM] TensorRT-LLM version: 0.8.00.8.0 Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00, 1.36s/it] Weights loaded. Total time: 00:00:10 Total time of converting checkpoints: 00:02:05 [TensorRT-LLM] TensorRT-LLM version: 0.8.0[04/22/2024-16:40:34] [TRT-LLM] [I] Set bert_attention_plugin to float16. [04/22/2024-16:40:34] [TRT-LLM] [I] Set gpt_attention_plugin to float16. [04/22/2024-16:40:34] [TRT-LLM] [I] Set gemm_plugin to float16. [04/22/2024-16:40:34] [TRT-LLM] [I] Set lookup_plugin to None. [04/22/2024-16:40:34] [TRT-LLM] [I] Set lora_plugin to None. [04/22/2024-16:40:34] [TRT-LLM] [I] Set context_fmha to True. [04/22/2024-16:40:34] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [04/22/2024-16:40:34] [TRT-LLM] [I] Set paged_kv_cache to True. [04/22/2024-16:40:34] [TRT-LLM] [I] Set remove_input_padding to True. [04/22/2024-16:40:34] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/22/2024-16:40:34] [TRT-LLM] [I] Set multi_block_mode to False. [04/22/2024-16:40:34] [TRT-LLM] [I] Set enable_xqa to True. [04/22/2024-16:40:34] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [04/22/2024-16:40:34] [TRT-LLM] [I] Set tokens_per_block to 128. [04/22/2024-16:40:34] [TRT-LLM] [I] Set use_paged_context_fmha to False. [04/22/2024-16:40:34] [TRT-LLM] [I] Set use_context_fmha_for_generation to False. [04/22/2024-16:40:34] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_sizemax_input_len. It may not be optimal to set max_num_tokens=max_batch_sizemax_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads. [04/22/2024-16:40:34] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 183, GPU 256 (MiB) [04/22/2024-16:41:24] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1798, GPU +312, now: CPU 2117, GPU 568 (MiB) [04/22/2024-16:41:24] [TRT-LLM] [I] Set nccl_plugin to None. [04/22/2024-16:41:24] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/22/2024-16:41:25] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0 [04/22/2024-16:41:25] [TRT] [W] Unused Input: position_ids [04/22/2024-16:41:25] [TRT] [W] Detected layernorm nodes in FP16. [04/22/2024-16:41:25] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy. [04/22/2024-16:41:25] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed. [04/22/2024-16:41:25] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2153, GPU 594 (MiB) [04/22/2024-16:41:25] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 2155, GPU 604 (MiB) [04/22/2024-16:41:25] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2 [04/22/2024-16:41:25] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored. [04/22/2024-16:41:35] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called. [04/22/2024-16:41:35] [TRT] [I] Detected 106 inputs and 1 output network tensors. [04/22/2024-16:41:40] [TRT] [I] Total Host Persistent Memory: 82640 [04/22/2024-16:41:40] [TRT] [I] Total Device Persistent Memory: 0 [04/22/2024-16:41:40] [TRT] [I] Total Scratch Memory: 537001984 [04/22/2024-16:41:40] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 619 steps to complete. [04/22/2024-16:41:40] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 24.3962ms to assign 12 blocks to 619 nodes requiring 3238006272 bytes. [04/22/2024-16:41:40] [TRT] [I] Total Activation Memory: 3238006272 [04/22/2024-16:41:40] [TRT] [I] Total Weights Memory: 13476831232 [04/22/2024-16:41:40] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2192, GPU 13474 (MiB) [04/22/2024-16:41:40] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 2193, GPU 13484 (MiB) [04/22/2024-16:41:40] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2 [04/22/2024-16:41:40] [TRT] [I] Engine generation completed in 15.4387 seconds. [04/22/2024-16:41:40] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 12853 MiB [04/22/2024-16:41:40] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +12853, now: CPU 0, GPU 12853 (MiB) [04/22/2024-16:41:47] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 28514 MiB [04/22/2024-16:41:47] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:22 [04/22/2024-16:41:48] [TRT-LLM] [I] Serializing engine to /tensorrt/tensorrt-models/Llama-2-7b-chat-hf/v0.8.0/trt-engines/fp16/1-gpu/rank0.engine... [04/22/2024-16:42:09] [TRT-LLM] [I] Engine serialized. Total time: 00:00:21 [04/22/2024-16:42:10] [TRT-LLM] [I] Total time of building all engines: 00:01:36 [TensorRT-LLM][INFO] Engine version 0.8.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be array, but is null [TensorRT-LLM][WARNING] Optional value for parameter lora_target_modules will not be set. [TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][INFO] MPI size: 1, rank: 0 [TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available. [TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available. [TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available. [TensorRT-LLM][INFO] Loaded engine size: 12855 MiB [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13001, GPU 13130 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 13002, GPU 13140 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2 [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +12852, now: CPU 0, GPU 12852 (MiB) [TensorRT-LLM][WARNING] The value of maxAttentionWindow cannot exceed maxSequenceLength. Therefore, it has been adjusted to match the value of maxSequenceLength. [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13035, GPU 16242 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 13035, GPU 16250 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2 [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 12852 (MiB) [TensorRT-LLM][INFO] Allocate 5972688896 bytes for k/v cache. [TensorRT-LLM][INFO] Using 11392 tokens in paged KV cache. [TensorRT-LLM] TensorRT-LLM version: 0.8.0Input [Text 0]: " [INST] What is deep learning? [/INST]" Output [Text 0 Beam 0]: " Deep learning is a subfield of machine learning that involves the use of artificial neural networks to model and solve complex problems. Here are some key things to know about deep learning:

  1. Artificial Neural Networks (ANNs): Deep learning algorithms are based on artificial neural networks, which are modeled after the structure and function of the human brain. ANNs consist of interconnected nodes or neurons that process inputs and produce outputs.
  2. Multi-Layer Perceptron (MLP): The most common type of deep learning algorithm is the multi-layer perceptron (MLP), which consists of multiple layers of neurons with nonlinear activation functions. Each layer processes the output from the previous layer, allowing the network to learn increasingly complex patterns in the data.
  3. Convolutional Neural Networks (CNNs): CNNs are a type of deep learning algorithm specifically designed for image recognition tasks. They use convolutional and pooling layers to extract features from images, followed by fully connected layers to make predictions.
  4. Recurrent Neural Networks (RNNs): RNNs are a type of deep learning algorithm used for sequential data, such as"

llama2_llm_tensorrt_engine_build_and_test.sh looks like this:

#!/bin/bash

HF_MODEL_NAME="Llama-2-7b-chat-hf"
HF_MODEL_PATH="meta-llama/Llama-2-7b-chat-h"
# Clone the Hugging Face model repository
# ...
# Convert the model checkpoint to TensorRT format
python /tensorrt/v0.8.0/tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py \
    --model_dir /tensorrt/models/$HF_MODEL_NAME \
    --output_dir /tensorrt/tensorrt-models/$HF_MODEL_NAME/v0.8.0/trt-checkpoints/fp16/1-gpu/ \
    --dtype float16
# Build TensorRT engine
trtllm-build --checkpoint_dir /tensorrt/tensorrt-models/$HF_MODEL_NAME/v0.8.0/trt-checkpoints/fp16/1-gpu/ \
    --output_dir /tensorrt/tensorrt-models/$HF_MODEL_NAME/v0.8.0/trt-engines/fp16/1-gpu/ \
    --remove_input_padding enable \
    --context_fmha enable \
    --gemm_plugin float16 \
    --max_input_len 32768 \
    --strongly_typed
# Run inference with the TensorRT engine
python3 /tensorrt/v0.8.0/tensorrtllm_backend/tensorrt_llm/examples/run.py \
    --max_output_len=250 \
    --tokenizer_dir /tensorrt/models/$HF_MODEL_NAME \
    --engine_dir=/tensorrt/tensorrt-models/$HF_MODEL_NAME/v0.8.0/trt-engines/fp16/1-gpu/ \
    --max_attention_window_size=4096 \
    --temperature=0.3 \
    --top_k=50 \
    --top_p=0.9 \
    --repetition_penalty=1.2 \
    --input_text="[INST] What is deep learning? [/INST]"

also what i notices is when i measure the latency of of the run.py - it takes 21 seconds to run it - why is that so slow?

time python3 /tensorrt/v0.8.0/tensorrtllm_backend/tensorrt_llm/examples/run.py \
    --max_output_len=250 \
    --tokenizer_dir /tensorrt/models/$HF_MODEL_NAME \
    --engine_dir=/tensorrt/tensorrt-models/$HF_MODEL_NAME/v0.8.0/trt-engines/fp16/1-gpu/ \
    --max_attention_window_size=4096 \
    --temperature=0.3 \
    --top_k=50 \
    --top_p=0.9 \
    --repetition_penalty=1.2 \
    --input_text="[INST] What is deep learning? [/INST]"

...

real   0m21.735s
user  0m11.898s
sys    0m14.218s

additional notes

here is how i install tensorrt-llm: https://medium.com/trendyol-tech/deploying-a-large-language-model-llm-with-tensorrt-llm-on-triton-inference-server-a-step-by-step-d53fccc856fa

byshiue commented 2 months ago

For warning

[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.

it is expected due to your communication topology. If your topo is PXB, then you will not see the warning.

For the timing, if you use time to measure, it contains the loading model, allocating the buffer at the beginning, and many initialization and finalization of environment. You could add --run_profiling to see the real inference time.

geraldstanje commented 2 months ago

@byshiue are you sure that is only a warning and it will work with 1 and 4 gpus?

byshiue commented 2 months ago

In such topo, you need to disable the use_custom_all_reduce.