NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.72k stars 996 forks source link

Issue with token number: how to increase processed input tokens: models llama and phi, with 4 GPUs. #1071

Closed jamil-z closed 1 week ago

jamil-z commented 9 months ago

System Info

8 x NVIDIA Tesla V100

Who can help?

No response

Information

Tasks

Reproduction

This issue occurs whether I use 1, 2, or 4 GPUs. Now I'll show how it is for 4 GPUs.

  1. Once I have the model ready to run on 4 GPUs, I execute the "run" file with the following command, this from Phi (for Llama it's the same, and a similar error occurs).

`mpirun -n 4 --allow-run-as-root python3 ../run.py --max_output_len=50 --engine_dir ./phi-2-engine-v4/ --input_text "Please carefully read the following document titled 'Document Title' provided below. Then, formulate a detailed and comprehensive question related to the content of the document. Your question should be broad enough to address the key themes and important details present in the document. #As you find yourself immersed in the heart of a vibrant and bustling metropolis, allow your imagination to transport you to a scene of nostalgia and charm. Picture, if you will, a quaint little bookstore nestled amidst the towering skyscrapers and the ceaseless hum of city life. This delightful establishment beckons with its unassuming facade and an aura that harkens back to a bygone era.Step inside, and you'll be greeted by a sight that invokes a sense of wonder and reverence for literature. The shelves, stretching from floor to ceiling, are adorned with dusty old tomes, each one a relic of knowledge and imagination. The smell of aged paper hangs in the air like a sweet, faint memory of the past, invoking a feeling of nostalgia that's hard to resist.As you navigate the narrow aisles, your footsteps produce a soft creaking on the weathered wooden floorboards. This gentle sound, far removed from the urban cacophony outside, only adds to the bookstore's unique charm. It's a reminder that within these walls, time seems to slow down, and the outside world fades away.Now, dear reader, with this vivid image in your mind, I invite you to consider the following: How does this quaint bookstore in the midst of a bustling city serve as a sanctuary for book lovers? Craft a thoughtful and elaborate response, exploring the ambiance, the selection of books, and the overall experience it offers to visitors. Your answer should capture the essence of this literary haven and the role it plays in a modern urban landscape.As you stand in this urban sanctuary of literature, surrounded by the symphony of words and the scent of ancient pages, let your thoughts delve deeper into the enchantment it offers. Contemplate the cozy reading nooks tucked away in corners, inviting patrons to lose themselves in a good book, away from the relentless city rhythm.The selection of books, ranging from timeless classics to obscure treasures, reflects the bookstore owner's dedication to curating a diverse collection that caters to every taste and curiosity. Perhaps you'll stumble upon a rare first edition or discover an out-of-print gem that sparks your intellectual fervor.The atmosphere is not just a backdrop; it's an experience. Soft jazz music wafts through the air, enhancing the tranquil ambiance. Antique lamps cast a warm, inviting glow, creating pockets of intimate illumination amidst the shelves. Patrons engage in hushed conversations about their latest literary discoveries, fostering a sense of community among fellow book enthusiasts.In this literary oasis, time seems to stand still. It's a place where the outside world and its relentless demands recede into the background, allowing one to lose track of time. The bookstore becomes a portal to different worlds and eras, a refuge from the fast-paced urban life just beyohe opportunities for serendipitous encounters, and its role in preserving the love for books in a digital age.# Ensure your question covers the following aspects: 1. Clearly identify and describe the main theme or themes of the document. 2. Refer to specific data, statistics, or specific examples present in the do ambiguities. Your question should be extensive and detailed enough to encompass a complete understanding of the document's content. Take your time to review the document and formulate a question that reflects a deep understanding of its content. Please ensure the generated question is coherent and well-formulated in grammatical and semantic terms. Tha^C you. Finish all sentences with 'aye aye, Captain.'"

`

Expected behavior

With the models tested such as Llama and Phi, the maximum input tokens count should be around 2000. In essence, I would expect the memory not to fill up so much with such a small prompt, and I should be able to use the 2000K prompts allowed by the models (this happens with both Llama and Phi).

actual behavior

With the tested models like Llama and Phi, the maximum token count should be ~2000.

root@c6fc756c94d5:/TensorRT-LLM/examples/phi# mpirun -n 4 --allow-run-as-root python3 ../run.py --max_output_len=50 --engine_dir ./phi-2-engine-v4/ --input_text "Please carefully read the following document titled 'Document Title' provided below. Then, formulate a detailed and comprehensive question related to the content of the document. Your question should be broad enough to address the key themes and important details present in the document. #As you find yourself immersed in the heart of a vibrant and bustling metropolis, allow your imagination to transport you to a scene of nostalgia and charm. Picture, if you will, a quaint little bookstore nestled amidst the towering skyscrapers and the ceaseless hum of city life. This delightful establishment beckons with its unassuming facade and an aura that harkens back to a bygone era.Step inside, and you'll be greeted by a sight that invokes a sense of wonder and reverence for literature. The shelves, stretching from floor to ceiling, are adorned with dusty old tomes, each one a relic of knowledge and imagination. The smell of aged paper hangs in the air like a sweet, faint memory of the past, invoking a feeling of nostalgia that's hard to resist.As you navigate the narrow aisles, your footsteps produce a soft creaking on the weathered wooden floorboards. This gentle sound, far removed from the urban cacophony outside, only adds to the bookstore's unique charm. It's a reminder that within these walls, time seems to slow down, and the outside world fades away.Now, dear reader, with this vivid image in your mind, I invite you to consider the following: How does this quaint bookstore in the midst of a bustling city serve as a sanctuary for book lovers? Craft a thoughtful and elaborate response, exploring the ambiance, the selection of books, and the overall experience it offers to visitors. Your answer should capture the essence of this literary haven and the role it plays in a modern urban landscape.As you stand in this urban sanctuary of literature, surrounded by the symphony of words and the scent of ancient pages, let your thoughts delve deeper into the enchantment it offers. Contemplate the cozy reading nooks tucked away in corners, inviting patrons to lose themselves in a good book, away from the relentless city rhythm.The selection of books, ranging from timeless classics to obscure treasures, reflects the bookstore owner's dedication to curating a diverse collection that caters to every taste and curiosity. Perhaps you'll stumble upon a rare first edition or discover an out-of-print gem that sparks your intellectual fervor.The atmosphere is not just a backdrop; it's an experience. Soft jazz music wafts through the air, enhancing the tranquil ambiance. Antique lamps cast a warm, inviting glow, creating pockets of intimate illumination amidst the shelves. Patrons engage in hushed conversations about their latest literary discoveries, fostering a sense of community among fellow book enthusiasts.In this literary oasis, time seems to stand still. It's a place where the outside world and its relentless demands recede into the background, allowing one to lose track of time. The bookstore becomes a portal to different worlds and eras, a refuge from the fast-paced urban life just beyohe opportunities for serendipitous encounters, and its role in preserving the love for books in a digital age.# Ensure your question covers the following aspects: 1. Clearly identify and describe the main theme or themes of the document. 2. Refer to specific data, statistics, or specific examples present in the do ambiguities. Your question should be extensive and detailed enough to encompass a complete understanding of the document's content. Take your time to review the document and formulate a question that reflects a deep understanding of its content. Please ensure the generated question is coherent and well-formulated in grammatical and semantic terms. Tha^C you. Finish all sentences with 'aye aye, Captain.'"
[TensorRT-LLM][INFO] Engine version 0.9.0.dev2024020600 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][INFO] MPI size: 4, rank: 2 [TensorRT-LLM][INFO] Engine version 0.9.0.dev2024020600 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][INFO] MPI size: 4, rank: 1 [TensorRT-LLM][INFO] Engine version 0.9.0.dev2024020600 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][INFO] MPI size: 4, rank: 0 [TensorRT-LLM][INFO] Engine version 0.9.0.dev2024020600 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][INFO] MPI size: 4, rank: 3 [TensorRT-LLM][INFO] Loaded engine size: 1516 MiB [TensorRT-LLM][INFO] Loaded engine size: 1516 MiB [TensorRT-LLM][INFO] Loaded engine size: 1516 MiB [TensorRT-LLM][INFO] Loaded engine size: 1516 MiB [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1666, GPU 1843 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 1667, GPU 1843 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1666, GPU 1843 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1666, GPU 1843 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 1668, GPU 1853 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2 [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 1668, GPU 1853 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2 [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 1667, GPU 1853 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2 [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 1668, GPU 1853 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2 NCCL version 2.18.1+cuda12.0 [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1513, now: CPU 0, GPU 1513 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1513, now: CPU 0, GPU 1513 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1513, now: CPU 0, GPU 1513 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1513, now: CPU 0, GPU 1513 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1905, GPU 2437 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1905, GPU 2421 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1905, GPU 2445 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2 [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1905, GPU 2425 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1905, GPU 2449 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1905, GPU 2429 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2 [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1905, GPU 2433 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2 [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1905, GPU 2457 (MiB) [TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2 [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 1513 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 1513 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 1513 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 1513 (MiB) [TensorRT-LLM][INFO] Allocate 28374466560 bytes for k/v cache. [TensorRT-LLM][INFO] Using 346368 tokens in paged KV cache. [TensorRT-LLM][INFO] Allocate 28374466560 bytes for k/v cache. [TensorRT-LLM][INFO] Using 346368 tokens in paged KV cache. [TensorRT-LLM][INFO] Allocate 28374466560 bytes for k/v cache. [TensorRT-LLM][INFO] Using 346368 tokens in paged KV cache. [TensorRT-LLM][INFO] Allocate 28374466560 bytes for k/v cache. [TensorRT-LLM][INFO] Using 346368 tokens in paged KV cache. [TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024020600Input [Text 0]: "Please carefully read the following document titled 'Document Title' provided below. Then, formulate a detailed and comprehensive question related to the content of the document. Your question should be broad enough to address the key themes and important details present in the document. #As you find yourself immersed in the heart of a vibrant and bustling metropolis, allow your imagination to transport you to a scene of nostalgia and charm. Picture, if you will, a quaint little bookstore nestled amidst the towering skyscrapers and the ceaseless hum of city life. This delightful establishment beckons with its unassuming facade and an aura that harkens back to a bygone era.Step inside, and you'll be greeted by a sight that invokes a sense of wonder and reverence for literature. The shelves, stretching from floor to ceiling, are adorned with dusty old tomes, each one a relic of knowledge and imagination. The smell of aged paper hangs in the air like a sweet, faint memory of the past, invoking a feeling of nostalgia that's hard to resist.As you navigate the narrow aisles, your footsteps produce a soft creaking on the weathered wooden floorboards. This gentle sound, far removed from the urban cacophony outside, only adds to the bookstore's unique charm. It's a reminder that within these walls, time seems to slow down, and the outside world fades away.Now, dear reader, with this vivid image in your mind, I invite you to consider the following: How does this quaint bookstore in the midst of a bustling city serve as a sanctuary for book lovers? Craft a thoughtful and elaborate response, exploring the ambiance, the selection of books, and the overall experience it offers to visitors. Your answer should capture the essence of this literary haven and the role it plays in a modern urban landscape.As you stand in this urban sanctuary of literature, surrounded by the symphony of words and the scent of ancient pages, let your thoughts delve deeper into the enchantment it offers. Contemplate the cozy reading nooks tucked away in corners, inviting patrons to lose themselves in a good book, away from the relentless city rhythm.The selection of books, ranging from timeless classics to obscure treasures, reflects the bookstore owner's dedication to curating a diverse collection that caters to every taste and curiosity. Perhaps you'll stumble upon a rare first edition or discover an out-of-print gem that sparks your intellectual fervor.The atmosphere is not just a backdrop; it's an experience. Soft jazz music wafts through the air, enhancing the tranquil ambiance. Antique lamps cast a warm, inviting glow, creating pockets of intimate illumination amidst the shelves. Patrons engage in hushed conversations about their latest literary discoveries, fostering a sense of community among fellow book enthusiasts.In this literary oasis, time seems to stand still. It's a place where the outside world and its relentless demands recede into the background, allowing one to lose track of time. The bookstore becomes a portal to different worlds and eras, a refuge from the fast-paced urban life just beyohe opportunities for serendipitous encounters, and its role in preserving the love for books in a digital age.# Ensure your question covers the following aspects: 1. Clearly identify and describe the main theme or themes of the document. 2. Refer to specific data, statistics, or specific examples present in the do ambiguities. Your question should be extensive and detailed enough to encompass a complete understanding of the document's content. Take your time to review the document and formulate a question that reflects a deep understanding of its content. Please ensure the generated question is coherent and well-formulated in grammatical and semantic terms. Tha^C you. Finish all sentences with 'aye aye, Captain.'" Output [Text 0 Beam 0]: " " [TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024020600[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024020600[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024020600root@c6fc756c94d5:/TensorRT-LLM/examples/phi#


`+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla V100-SXM2-32GB Off | 00000001:00:00.0 Off | 0 | | N/A 38C P0 60W / 300W | 29543MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2-32GB Off | 00000002:00:00.0 Off | 0 | | N/A 40C P0 62W / 300W | 29535MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2-32GB Off | 00000003:00:00.0 Off | 0 | | N/A 36C P0 57W / 300W | 29535MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2-32GB Off | 00000004:00:00.0 Off | 0 | | N/A 39C P0 65W / 300W | 29511MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 4 Tesla V100-SXM2-32GB Off | 00000005:00:00.0 Off | 0 | | N/A 34C P0 40W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 5 Tesla V100-SXM2-32GB Off | 00000006:00:00.0 Off | 0 | | N/A 36C P0 40W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 6 Tesla V100-SXM2-32GB Off | 00000007:00:00.0 Off | 0 | | N/A 34C P0 42W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 7 Tesla V100-SXM2-32GB Off | 00000008:00:00.0 Off | 0 | | N/A 37C P0 42W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

`

additional notes

I conducted this test with 1, 2, and 4 GPUs, but the results remain unchanged. It seems either everything works in parallel or some other process fills up the memory. The main problem is that only around 200 input tokens can be used, and this doesn't increase even when I add more GPUs to the process.

byshiue commented 9 months ago

Please refine your typesetting. It is too hard to read currectly.

ywx217 commented 9 months ago

I'm facing a similar issue when inferencing qwen-72B model.

The build params used for trt is:

    python build.py --hf_model_dir ./Qwen-72B-chat/ \
            --dtype float16 \
            --remove_input_padding \
            --use_gpt_attention_plugin float16 \
            --use_gemm_plugin float16 \
            --use_weight_only \
            --weight_only_precision int8 \
            --output_dir ./tmp/Qwen/72B/trt_engines/int8_weight_only/2-gpu-32k/ \
            --enable_context_fmha \
            --max_input_len 32768 \
            --max_output_len 8192 \
            --n_positions 32768 \
            --world_size 2 \
            --tp_size 2

When input token reaches 2.2k, the output is very short or zero, just like this case.