NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.11k stars 896 forks source link

MPI Runtime error when running llama3 70B tp_size=8 #1700

Closed WDONG66 closed 3 months ago

WDONG66 commented 3 months ago

System Info

trt_llm 0.11.0.dev2024052800 trt 10.0.1 device A800 coda for Tensorrt_llm: latest version in main branch

Who can help?

@byshiue

Information

Tasks

Reproduction

trtllm-build --checkpoint_dir {checkpoint_dir} --output_dir {output_dir} --gemm_plugin float16 --gpt_attention_plugin float16 --paged_kv_cache enable --max_batch_size 32 --max_input_len 1024 --max_output_len 1024 --tp_size 8 --pp_size 1

mpirun -n 8 --allow-run-as-root python3 run.py --max_output_len=2048 --tokenizer_dir {tokenizer_dir} --engine_dir {engine_dir} --streaming --log_level info

Expected behavior

Run successfully

actual behavior


Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun noticed that process rank 6 with PID 0 on node xxxxxxxx exited on signal 9 (Killed).

I also feel a little confused about parameter in config.json "auto_parallel_config": { "world_size": 1, "gpus_per_node": 8,.......

is that correct for tp_size=8 pp_size=1 ? (For my unstanding, world_size shouled equal to tp_size * pp_size)

additional notes

None

nv-guomingz commented 3 months ago

Hi @WDONG66 could u please paste the full log of running the cmd mpirun -n 8 --allow-run-as-root python3 run.py --max_output_len=2048 --tokenizer_dir {tokenizer_dir} --engine_dir {engine_dir} --streaming --log_level info?

For your second question, this part config is related with auto pipeline parallel feature which has nothing to do with tp.

Please refer to this part for your parallel configuration.

image
nv-guomingz commented 3 months ago

I managed to run llama3-70b on 8xL40s with below output.

mpirun -n 8 --allow-run-as-root python3 ../run.py --max_output_len=2048 --tokenizer_dir ./llama-3-70b-inst --engine_dir ./engine_outputs  --max_output_len=2048
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Input [Text 0]: "<|begin_of_text|>Born in north-east France, Soyer trained as a"
Output [Text 0 Beam 0]: " chef in Paris before moving to England in 1830. He became one of the most celebrated chefs of his time, known for his innovative and elaborate dishes. Soyer was also a prolific writer and published several cookbooks, including 'The Gastronomic Regenerator' (1846) and 'Soyer's Culinary Campaign' (1857). He was a pioneer of modern French cuisine in England and was chef de cuisine at the Reform Club in London from 1837 to 1850. Soyer was also a philanthropist and was involved in several charitable projects, including providing food for the poor during the Irish Famine. He died in 1858, aged 48, and was buried in Kensal Green Cemetery, London. This portrait, by the French artist François Gérard, dates from around 1845. It shows Soyer in his chef's uniform, with a white hat and apron, and a proud expression on his face. The painting is a testament to Soyer's reputation as a master chef and his contribution to the culinary world. It is now held in the collection of the Victoria and Albert Museum in London. François Gérard was a French painter who was active in Paris during the early 19th century. He was known for his portraits of prominent figures, including Napoleon Bonaparte and several members of the French aristocracy. Gérard's style was characterized by his use of rich colors and his ability to capture the personality and character of his subjects. This portrait of Soyer is a fine example of Gérard's work and provides a fascinating glimpse into the life of one of the most influential chefs of the 19th century."
WDONG66 commented 3 months ago

Hello, thanks for your answer, the full log is as following:

root@szzj-isa-ai-peking-poc13:/workspace/trt_llm/TensorRT-LLM/examples# mpirun -n 8 --allow-run-as-root --merge-stderr-to-stdout python3 run.py --max_output_len=2048 --tokenizer_dir /workspace/models/Meta-Llama-3-70B --engine_dir /workspace/llama3_70B_engine/fp16 --loglevel info --streaming [05/30/2024-08:54:43] [TRT-LLM] [W] Found pynvml==11.5.0 and cuda driver version 525.125.06. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage. [05/30/2024-08:54:43] [TRT-LLM] [W] Found pynvml==11.5.0 and cuda driver version 525.125.06. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage. [05/30/2024-08:54:43] [TRT-LLM] [W] Found pynvml==11.5.0 and cuda driver version 525.125.06. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage. [05/30/2024-08:54:43] [TRT-LLM] [W] Found pynvml==11.5.0 and cuda driver version 525.125.06. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage. [05/30/2024-08:54:43] [TRT-LLM] [W] Found pynvml==11.5.0 and cuda driver version 525.125.06. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage. [05/30/2024-08:54:43] [TRT-LLM] [W] Found pynvml==11.5.0 and cuda driver version 525.125.06. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage. [05/30/2024-08:54:43] [TRT-LLM] [W] Found pynvml==11.5.0 and cuda driver version 525.125.06. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage. [05/30/2024-08:54:43] [TRT-LLM] [W] Found pynvml==11.5.0 and cuda driver version 525.125.06. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage. [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800 [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800 [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800 [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800 [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800 [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800 [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800 [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. -____ Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. - - mpirun noticed that process rank 7 with PID 0 on node szzj-isa-ai-peking-poc13 exited on signal 9 (Killed). -_____

and for the second question, i have checked the parallel configuration, it is the same as yours.

I think maybe my mpi version is different from yours? or the RAM size and any resources is limited inside my docker?

nv-guomingz commented 3 months ago

My mpirun version is as below mpirun --version mpirun (Open MPI) 4.1.7a1 and my ram size is 1 TB.

My suggestion is you may try other model like llama v2 to see if the issue still exists. If so, the issue is nothing to do with the specific model but your env.

nv-guomingz commented 3 months ago

[05/30/2024-08:54:43] [TRT-LLM] [W] Found pynvml==11.5.0 and cuda driver version 525.125.06. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage.

My cuda driver is 550.54.14

WDONG66 commented 3 months ago

OK, Thanks a lot, I updated the memory and memory-swap of my docker and it works~