Open Hongbosherlock opened 3 months ago
Could you try pip install tensorrt_llm==0.11.0.dev2024061100
first?
Or you can try pip install tensorrt_llm==0.11.0.dev2024061800
tomorrow.
Thanks
It works well using the latest version
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L40S On | 00000000:01:00.0 Off | 0 |
| N/A 31C P8 32W / 350W | 0MiB / 46068MiB | 0% Default |
| | | N/A
python examples/quantization/quantize.py --model_dir /home/scratch.trt_llm_data/llm-models/llama-models-v3/llama-v3-8b-instruct-hf/ --dtype float16 --qformat int4_awq --awq_block_size 128 --output_dir ./tmp/llama3-8b-awq --calib_size 32; trtllm-build --checkpoint_dir ./tmp/llama3-8b-awq --output_dir ./tmp/llama3-8b-awq-engine --gemm_plugin float16 --gpt_attention_plugin float16 --context_fmha enable --remove_input_padding enable --paged_kv_cache enable --max_input_len 3000 --max_output_len 3000; python examples/run.py --max_output_len=500 --tokenizer_dir=/home/scratch.trt_llm_data/llm-models/llama-models-v3/llama-v3-8b-instruct-hf/ --engine_dir=./tmp/llama3-8b-awq-engine
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061800
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[06/18/2024-10:05:23] [TRT-LLM] [I] Load engine takes: 34.90785765647888 sec
Input [Text 0]: "<|begin_of_text|>Born in north-east France, Soyer trained as a"
Output [Text 0 Beam 0]: " painter and sculptor before turning to photography. He began his career in the 1920s, working for various magazines and newspapers, and quickly gained a reputation for his innovative and expressive style. Soyer's photographs often featured everyday life, landscapes, and still-life compositions, and were characterized by their use of light, texture, and composition. He was also known for his portraits of famous people, including artists, writers, and musicians. Soyer's work was widely exhibited and published, and he is considered one of the most important French photographers of the 20th century. (Source: Getty Museum) [more]
...
Could you try
pip install tensorrt_llm==0.11.0.dev2024061100
first? Or you can trypip install tensorrt_llm==0.11.0.dev2024061800
tomorrow. Thanks
sucessfully installed new version,
but got errors when running:
Traceback (most recent call last):
File "/TensorRT-LLM/examples/run.py", line 23, in <module>
from utils import (DEFAULT_HF_MODEL_DIRS, DEFAULT_PROMPT_TEMPLATES,
File "/TensorRT-LLM/examples/utils.py", line 23, in <module>
from tensorrt_llm.builder import get_engine_version
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/__init__.py", line 32, in <module>
import tensorrt_llm.functional as functional
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 25, in <module>
import tensorrt as trt
File "/usr/local/lib/python3.10/dist-packages/tensorrt/__init__.py", line 18, in <module>
from tensorrt_bindings import *
ModuleNotFoundError: No module named 'tensorrt_bindings'
Could you try
pip install tensorrt_llm==0.11.0.dev2024061100
first? Or you can trypip install tensorrt_llm==0.11.0.dev2024061800
tomorrow. Thankssucessfully installed new version,
but got errors when running:
Traceback (most recent call last): File "/TensorRT-LLM/examples/run.py", line 23, in <module> from utils import (DEFAULT_HF_MODEL_DIRS, DEFAULT_PROMPT_TEMPLATES, File "/TensorRT-LLM/examples/utils.py", line 23, in <module> from tensorrt_llm.builder import get_engine_version File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/__init__.py", line 32, in <module> import tensorrt_llm.functional as functional File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 25, in <module> import tensorrt as trt File "/usr/local/lib/python3.10/dist-packages/tensorrt/__init__.py", line 18, in <module> from tensorrt_bindings import * ModuleNotFoundError: No module named 'tensorrt_bindings'
Please try the container: nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3
Or build the container using make -C docker release_build
Output [Text 0 Beam 0]: " painter and sculptor before turning to photography. He began his career in the 1920s, working for various magazines and newspapers, and quickly gained a reputation for his innovative and expressive style. Soyer's photographs often featured everyday life, landscapes, and still-life compositions, and were characterized by their use of light, texture, and composition. He was also known for his portraits of famous people, including artists, writers, and musicians. Soyer's work was widely exhibited and published, and he is considered one of the most important French photographers of the 20th century. (Source: Getty Museum) [more]
thanks, it works for me now.
It works well using the latest version
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA L40S On | 00000000:01:00.0 Off | 0 | | N/A 31C P8 32W / 350W | 0MiB / 46068MiB | 0% Default | | | | N/A python examples/quantization/quantize.py --model_dir /home/scratch.trt_llm_data/llm-models/llama-models-v3/llama-v3-8b-instruct-hf/ --dtype float16 --qformat int4_awq --awq_block_size 128 --output_dir ./tmp/llama3-8b-awq --calib_size 32; trtllm-build --checkpoint_dir ./tmp/llama3-8b-awq --output_dir ./tmp/llama3-8b-awq-engine --gemm_plugin float16 --gpt_attention_plugin float16 --context_fmha enable --remove_input_padding enable --paged_kv_cache enable --max_input_len 3000 --max_output_len 3000; python examples/run.py --max_output_len=500 --tokenizer_dir=/home/scratch.trt_llm_data/llm-models/llama-models-v3/llama-v3-8b-instruct-hf/ --engine_dir=./tmp/llama3-8b-awq-engine [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061800 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [06/18/2024-10:05:23] [TRT-LLM] [I] Load engine takes: 34.90785765647888 sec Input [Text 0]: "<|begin_of_text|>Born in north-east France, Soyer trained as a" Output [Text 0 Beam 0]: " painter and sculptor before turning to photography. He began his career in the 1920s, working for various magazines and newspapers, and quickly gained a reputation for his innovative and expressive style. Soyer's photographs often featured everyday life, landscapes, and still-life compositions, and were characterized by their use of light, texture, and composition. He was also known for his portraits of famous people, including artists, writers, and musicians. Soyer's work was widely exhibited and published, and he is considered one of the most important French photographers of the 20th century. (Source: Getty Museum) [more] ...
hi @hijkzzz, can you run the benchmark successfully on L40s? When I run:
./benchmarks/gptSessionBenchmark \
--engine_dir "/target/model/trt_engines/w4a8_AWQ/1-gpu/" \
--batch_size "1" \
--input_output_len "60,20"
I got errors:
[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: key_size <= remaining_buffer_size (/target/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/cubinObjRegistry.h:49)
1 0x5584e5f4b373 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2 0x7f0e64a684b7 tensorrt_llm::kernels::jit::CubinObjRegistryTemplate<tensorrt_llm::kernels::XQAKernelFullHashKey, tensorrt_llm::kernels::XQAKernelFullHasher>::CubinObjRegistryTemplate(void const*, unsigned long) + 1639
3 0x7f0e64a674f2 tensorrt_llm::kernels::DecoderXQARunner::Resource::Resource(void const*, unsigned long) + 50
4 0x7f0e9ad778f9 tensorrt_llm::plugins::GPTAttentionPluginCommon::GPTAttentionPluginCommon(void const*, unsigned long) + 873
5 0x7f0e9ad960d3 tensorrt_llm::plugins::GPTAttentionPlugin::GPTAttentionPlugin(void const*, unsigned long) + 19
6 0x7f0e9ad96152 tensorrt_llm::plugins::GPTAttentionPluginCreator::deserializePlugin(char const*, void const*, unsigned long) + 50
7 0x7f0e1493f102 /usr/local/tensorrt/lib/libnvinfer.so.10(+0x1066102) [0x7f0e1493f102]
8 0x7f0e1493a1de /usr/local/tensorrt/lib/libnvinfer.so.10(+0x10611de) [0x7f0e1493a1de]
9 0x7f0e148b5177 /usr/local/tensorrt/lib/libnvinfer.so.10(+0xfdc177) [0x7f0e148b5177]
10 0x7f0e148b33fe /usr/local/tensorrt/lib/libnvinfer.so.10(+0xfda3fe) [0x7f0e148b33fe]
11 0x7f0e148cbf27 /usr/local/tensorrt/lib/libnvinfer.so.10(+0xff2f27) [0x7f0e148cbf27]
12 0x7f0e148cee7d /usr/local/tensorrt/lib/libnvinfer.so.10(+0xff5e7d) [0x7f0e148cee7d]
13 0x7f0e148cf3b4 /usr/local/tensorrt/lib/libnvinfer.so.10(+0xff63b4) [0x7f0e148cf3b4]
14 0x7f0e148fe64f /usr/local/tensorrt/lib/libnvinfer.so.10(+0x102564f) [0x7f0e148fe64f]
15 0x7f0e148ff3f5 /usr/local/tensorrt/lib/libnvinfer.so.10(+0x10263f5) [0x7f0e148ff3f5]
16 0x7f0e148ff489 /usr/local/tensorrt/lib/libnvinfer.so.10(+0x1026489) [0x7f0e148ff489]
17 0x7f0e666a8a68 tensorrt_llm::runtime::TllmRuntime::TllmRuntime(void const*, unsigned long, float, nvinfer1::ILogger&) + 504
18 0x7f0e66653db6 tensorrt_llm::runtime::GptSession::GptSession(tensorrt_llm::runtime::GptSession::Config const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, void const*, unsigned long, std::shared_ptr<nvinfer1::ILogger>) + 1126
19 0x5584e5f4fce0 ./benchmarks/gptSessionBenchmark(+0x1dce0) [0x5584e5f4fce0]
20 0x7f0e5f5edd90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f0e5f5edd90]
21 0x7f0e5f5ede40 __libc_start_main + 128
22 0x5584e5f533c5 ./benchmarks/gptSessionBenchmark(+0x213c5) [0x5584e5f533c5]
@Hongbosherlock Base your crash log, I guess your're tried to run benchmark with w4a8_awq mode, right? If so, I managed to run benchmark on L40
./cpp/build/benchmarks/gptSessionBenchmark --engine_dir examples/quantization/engine_outputs --batch_size "1" --input_output_len "60,20"
Benchmarking done. Iteration: 10, duration: 1.56 sec.
Latencies: [155.62, 155.81, 155.41, 155.40, 156.54, 155.34, 155.42, 155.83, 156.30, 155.64]
[BENCHMARK] batch_size 1 input_length 60 output_length 20 latency(ms) 155.73 tokensPerSec 128.43 generation_time(ms) 145.62 generationTokensPerSec 137.35 gpu_peak_mem(gb) 43.57
Even for original int4_awq, I also could run the benchmark with below output
./cpp/build/benchmarks/gptSessionBenchmark --engine_dir examples/quantization/engine_outputs --batch_size "1" --input_output_len "60,20"
Benchmarking done. Iteration: 10, duration: 1.57 sec.
Latencies: [156.85, 156.74, 156.75, 156.88, 156.70, 156.73, 156.82, 156.93, 156.74, 156.89]
[BENCHMARK] batch_size 1 input_length 60 output_length 20 latency(ms) 156.80 tokensPerSec 127.55 generation_time(ms) 144.26 generationTokensPerSec 138.64 gpu_peak_mem(gb) 43.61
hi, what nvidia docker image uses latest TensorRT-LLM version: 0.11.0.dev ?
System Info
ubuntu 20.04 tensorrt 10.0.1 tensorrt-cu12 10.0.1 tensorrt-cu12-bindings 10.0.1 tensorrt-cu12-libs 10.0.1 tensorrt-llm 0.11.0.dev2024052100
nvidia L40s
Who can help?
@Barry-Delaney @Tracin @byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
using w4a8_awq
build
run:
Expected behavior
get inference result
actual behavior
got error message
when building the engine before,I got warnings like:
additional notes
When I try the same way with
int4_awq
,I got errors when tryingtrtllm-build
:and also: