Closed jamil-z closed 1 week ago
Please refine your typesetting. It is too hard to read currectly.
I'm facing a similar issue when inferencing qwen-72B model.
The build params used for trt is:
python build.py --hf_model_dir ./Qwen-72B-chat/ \
--dtype float16 \
--remove_input_padding \
--use_gpt_attention_plugin float16 \
--use_gemm_plugin float16 \
--use_weight_only \
--weight_only_precision int8 \
--output_dir ./tmp/Qwen/72B/trt_engines/int8_weight_only/2-gpu-32k/ \
--enable_context_fmha \
--max_input_len 32768 \
--max_output_len 8192 \
--n_positions 32768 \
--world_size 2 \
--tp_size 2
When input token reaches 2.2k, the output is very short or zero, just like this case.
System Info
8 x NVIDIA Tesla V100
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
This issue occurs whether I use 1, 2, or 4 GPUs. Now I'll show how it is for 4 GPUs.
`mpirun -n 4 --allow-run-as-root python3 ../run.py --max_output_len=50 --engine_dir ./phi-2-engine-v4/ --input_text "Please carefully read the following document titled 'Document Title' provided below. Then, formulate a detailed and comprehensive question related to the content of the document. Your question should be broad enough to address the key themes and important details present in the document. #As you find yourself immersed in the heart of a vibrant and bustling metropolis, allow your imagination to transport you to a scene of nostalgia and charm. Picture, if you will, a quaint little bookstore nestled amidst the towering skyscrapers and the ceaseless hum of city life. This delightful establishment beckons with its unassuming facade and an aura that harkens back to a bygone era.Step inside, and you'll be greeted by a sight that invokes a sense of wonder and reverence for literature. The shelves, stretching from floor to ceiling, are adorned with dusty old tomes, each one a relic of knowledge and imagination. The smell of aged paper hangs in the air like a sweet, faint memory of the past, invoking a feeling of nostalgia that's hard to resist.As you navigate the narrow aisles, your footsteps produce a soft creaking on the weathered wooden floorboards. This gentle sound, far removed from the urban cacophony outside, only adds to the bookstore's unique charm. It's a reminder that within these walls, time seems to slow down, and the outside world fades away.Now, dear reader, with this vivid image in your mind, I invite you to consider the following: How does this quaint bookstore in the midst of a bustling city serve as a sanctuary for book lovers? Craft a thoughtful and elaborate response, exploring the ambiance, the selection of books, and the overall experience it offers to visitors. Your answer should capture the essence of this literary haven and the role it plays in a modern urban landscape.As you stand in this urban sanctuary of literature, surrounded by the symphony of words and the scent of ancient pages, let your thoughts delve deeper into the enchantment it offers. Contemplate the cozy reading nooks tucked away in corners, inviting patrons to lose themselves in a good book, away from the relentless city rhythm.The selection of books, ranging from timeless classics to obscure treasures, reflects the bookstore owner's dedication to curating a diverse collection that caters to every taste and curiosity. Perhaps you'll stumble upon a rare first edition or discover an out-of-print gem that sparks your intellectual fervor.The atmosphere is not just a backdrop; it's an experience. Soft jazz music wafts through the air, enhancing the tranquil ambiance. Antique lamps cast a warm, inviting glow, creating pockets of intimate illumination amidst the shelves. Patrons engage in hushed conversations about their latest literary discoveries, fostering a sense of community among fellow book enthusiasts.In this literary oasis, time seems to stand still. It's a place where the outside world and its relentless demands recede into the background, allowing one to lose track of time. The bookstore becomes a portal to different worlds and eras, a refuge from the fast-paced urban life just beyohe opportunities for serendipitous encounters, and its role in preserving the love for books in a digital age.# Ensure your question covers the following aspects: 1. Clearly identify and describe the main theme or themes of the document. 2. Refer to specific data, statistics, or specific examples present in the do ambiguities. Your question should be extensive and detailed enough to encompass a complete understanding of the document's content. Take your time to review the document and formulate a question that reflects a deep understanding of its content. Please ensure the generated question is coherent and well-formulated in grammatical and semantic terms. Tha^C you. Finish all sentences with 'aye aye, Captain.'"
`
Expected behavior
With the models tested such as Llama and Phi, the maximum input tokens count should be around 2000. In essence, I would expect the memory not to fill up so much with such a small prompt, and I should be able to use the 2000K prompts allowed by the models (this happens with both Llama and Phi).
actual behavior
With the tested models like Llama and Phi, the maximum token count should be ~2000.
`+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla V100-SXM2-32GB Off | 00000001:00:00.0 Off | 0 | | N/A 38C P0 60W / 300W | 29543MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2-32GB Off | 00000002:00:00.0 Off | 0 | | N/A 40C P0 62W / 300W | 29535MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2-32GB Off | 00000003:00:00.0 Off | 0 | | N/A 36C P0 57W / 300W | 29535MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2-32GB Off | 00000004:00:00.0 Off | 0 | | N/A 39C P0 65W / 300W | 29511MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 4 Tesla V100-SXM2-32GB Off | 00000005:00:00.0 Off | 0 | | N/A 34C P0 40W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 5 Tesla V100-SXM2-32GB Off | 00000006:00:00.0 Off | 0 | | N/A 36C P0 40W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 6 Tesla V100-SXM2-32GB Off | 00000007:00:00.0 Off | 0 | | N/A 34C P0 42W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 7 Tesla V100-SXM2-32GB Off | 00000008:00:00.0 Off | 0 | | N/A 37C P0 42W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+
`
additional notes
I conducted this test with 1, 2, and 4 GPUs, but the results remain unchanged. It seems either everything works in parallel or some other process fills up the memory. The main problem is that only around 200 input tokens can be used, and this doesn't increase even when I add more GPUs to the process.