Open geraldstanje opened 3 months ago
For warning
[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.
it is expected due to your communication topology. If your topo is PXB, then you will not see the warning.
For the timing, if you use time
to measure, it contains the loading model, allocating the buffer at the beginning, and many initialization and finalization of environment. You could add --run_profiling
to see the real inference time.
@byshiue are you sure that is only a warning and it will work with 1 and 4 gpus?
In such topo, you need to disable the use_custom_all_reduce
.
System Info
gpu:
gpu topo:
Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available. [TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available. [TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available.
./llama2_llm_tensorrt_engine_build_and_test.sh [TensorRT-LLM] TensorRT-LLM version: 0.8.00.8.0 Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00, 1.36s/it] Weights loaded. Total time: 00:00:10 Total time of converting checkpoints: 00:02:05 [TensorRT-LLM] TensorRT-LLM version: 0.8.0[04/22/2024-16:40:34] [TRT-LLM] [I] Set bert_attention_plugin to float16. [04/22/2024-16:40:34] [TRT-LLM] [I] Set gpt_attention_plugin to float16. [04/22/2024-16:40:34] [TRT-LLM] [I] Set gemm_plugin to float16. [04/22/2024-16:40:34] [TRT-LLM] [I] Set lookup_plugin to None. [04/22/2024-16:40:34] [TRT-LLM] [I] Set lora_plugin to None. [04/22/2024-16:40:34] [TRT-LLM] [I] Set context_fmha to True. [04/22/2024-16:40:34] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [04/22/2024-16:40:34] [TRT-LLM] [I] Set paged_kv_cache to True. [04/22/2024-16:40:34] [TRT-LLM] [I] Set remove_input_padding to True. [04/22/2024-16:40:34] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/22/2024-16:40:34] [TRT-LLM] [I] Set multi_block_mode to False. [04/22/2024-16:40:34] [TRT-LLM] [I] Set enable_xqa to True. [04/22/2024-16:40:34] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [04/22/2024-16:40:34] [TRT-LLM] [I] Set tokens_per_block to 128. [04/22/2024-16:40:34] [TRT-LLM] [I] Set use_paged_context_fmha to False. [04/22/2024-16:40:34] [TRT-LLM] [I] Set use_context_fmha_for_generation to False. [04/22/2024-16:40:34] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_sizemax_input_len. It may not be optimal to set max_num_tokens=max_batch_sizemax_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads. [04/22/2024-16:40:34] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 183, GPU 256 (MiB) [04/22/2024-16:41:24] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1798, GPU +312, now: CPU 2117, GPU 568 (MiB) [04/22/2024-16:41:24] [TRT-LLM] [I] Set nccl_plugin to None. [04/22/2024-16:41:24] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/22/2024-16:41:25] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0 [04/22/2024-16:41:25] [TRT] [W] Unused Input: position_ids [04/22/2024-16:41:25] [TRT] [W] Detected layernorm nodes in FP16. [04/22/2024-16:41:25] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy. [04/22/2024-16:41:25] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed. [04/22/2024-16:41:25] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2153, GPU 594 (MiB) [04/22/2024-16:41:25] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 2155, GPU 604 (MiB) [04/22/2024-16:41:25] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2 [04/22/2024-16:41:25] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored. [04/22/2024-16:41:35] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called. [04/22/2024-16:41:35] [TRT] [I] Detected 106 inputs and 1 output network tensors. [04/22/2024-16:41:40] [TRT] [I] Total Host Persistent Memory: 82640 [04/22/2024-16:41:40] [TRT] [I] Total Device Persistent Memory: 0 [04/22/2024-16:41:40] [TRT] [I] Total Scratch Memory: 537001984 [04/22/2024-16:41:40] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 619 steps to complete. [04/22/2024-16:41:40] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 24.3962ms to assign 12 blocks to 619 nodes requiring 3238006272 bytes. [04/22/2024-16:41:40] [TRT] [I] Total Activation Memory: 3238006272 [04/22/2024-16:41:40] [TRT] [I] Total Weights Memory: 13476831232 [04/22/2024-16:41:40] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2192, GPU 13474 (MiB) [04/22/2024-16:41:40] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 2193, GPU 13484 (MiB) [04/22/2024-16:41:40] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2 [04/22/2024-16:41:40] [TRT] [I] Engine generation completed in 15.4387 seconds. [04/22/2024-16:41:40] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 12853 MiB [04/22/2024-16:41:40] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +12853, now: CPU 0, GPU 12853 (MiB) [04/22/2024-16:41:47] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 28514 MiB [04/22/2024-16:41:47] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:22 [04/22/2024-16:41:48] [TRT-LLM] [I] Serializing engine to /tensorrt/tensorrt-models/Llama-2-7b-chat-hf/v0.8.0/trt-engines/fp16/1-gpu/rank0.engine... [04/22/2024-16:42:09] [TRT-LLM] [I] Engine serialized. Total time: 00:00:21 [04/22/2024-16:42:10] [TRT-LLM] [I] Total time of building all engines: 00:01:36 [TensorRT-LLM][INFO] Engine version 0.8.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be array, but is null [TensorRT-LLM][WARNING] Optional value for parameter lora_target_modules will not be set. [TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][INFO] MPI size: 1, rank: 0 [TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available. [TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available. [TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available. [TensorRT-LLM][INFO] Loaded engine size: 12855 MiB [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13001, GPU 13130 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 13002, GPU 13140 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2 [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +12852, now: CPU 0, GPU 12852 (MiB) [TensorRT-LLM][WARNING] The value of maxAttentionWindow cannot exceed maxSequenceLength. Therefore, it has been adjusted to match the value of maxSequenceLength. [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13035, GPU 16242 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 13035, GPU 16250 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2 [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 12852 (MiB) [TensorRT-LLM][INFO] Allocate 5972688896 bytes for k/v cache. [TensorRT-LLM][INFO] Using 11392 tokens in paged KV cache. [TensorRT-LLM] TensorRT-LLM version: 0.8.0Input [Text 0]: "
[INST] What is deep learning? [/INST]" Output [Text 0 Beam 0]: " Deep learning is a subfield of machine learning that involves the use of artificial neural networks to model and solve complex problems. Here are some key things to know about deep learning:llama2_llm_tensorrt_engine_build_and_test.sh looks like this:
also what i notices is when i measure the latency of of the run.py - it takes 21 seconds to run it - why is that so slow?
additional notes
here is how i install tensorrt-llm: https://medium.com/trendyol-tech/deploying-a-large-language-model-llm-with-tensorrt-llm-on-triton-inference-server-a-step-by-step-d53fccc856fa