NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.03k stars 883 forks source link

Llama 2 Execution Bug #1510

Open Hudayday opened 4 months ago

Hudayday commented 4 months ago

System Info

CPU: x86_64, memory: 1024GB, GPU: 8*A6000 48GB each, Tensorrt-LLM version 0.9.0.DEV20240226. NVIDIA-Driver Version: 535.171.04 CUDA Version: 12.2; OS - Ubuntu 22.04

Who can help?

No response

Information

Tasks

Reproduction

This bug appears when I follow the examples/llama to build the engine with high tp values (4 or 8) and int4 quantization.

First Step: convert_checkpoint at examples/llama: python convert_checkpoint.py --model_dir ./tmp/llama/7B/ \ --output_dir ./tllm_checkpoint_8gpu_tp4_pp2 \ --dtype float16 \ --tp_size 4 \ --pp_size 2 \ --use_weight_only \ --weight_only_precision int4

Second Step: Build engine trtllm-build --checkpoint_dir ./tllm_checkpoint_8gpu_tp4_pp2 \ --output_dir ./tmp/llama/7B/trt_engines/fp16/8-gpu/ \ --gemm_plugin float16 --weight_only_precision --max_batch_size 1

Third Step: Execution using python session mpirun -n 8 python3 ../run.py --max_output_len 40 --input_file 2048.txt --engine_dir ./tmp/llama/7B/trt_engines/fp16/8-gpu/ --tokenizer_dir ./tmp/llama/7B/ --use_py_session

Here, the ../run.py reading 2048.txt contains inputs collected from "theblackcat102/sharegpt-english" dataset.

Expected behavior

The run.py should execute all inputs from file one-by-one

actual behavior

In most cases, it generates the output correctly. However, it gets stuck randomly when a particular input is received; it cannot generate output from runner.generate, and the GPU utilization gets stuck at 100% (normally around 60%). The input that triggers the bug is different every time, so I don't know exactly how it happens. 屏幕截图 2024-04-27 123552

outputs = runner.generate( batch_input_ids, max_new_tokens=args.max_output_len, max_attention_window_size=args.max_attention_window_size, end_id=end_id, pad_id=pad_id, temperature=args.temperature, top_k=args.top_k, top_p=args.top_p, num_beams=args.num_beams, length_penalty=args.length_penalty, repetition_penalty=args.repetition_penalty, presence_penalty=args.presence_penalty, frequency_penalty=args.frequency_penalty, stop_words_list=stop_words_list, bad_words_list=bad_words_list, lora_uids=args.lora_task_uids, prompt_table_path=args.prompt_table_path, prompt_tasks=args.prompt_tasks, streaming=args.streaming, output_sequence_lengths=True, return_dict=True) torch.cuda.synchronize()

additional notes

This happens for llama 7b and llama 13b with int 4 and tp > 4 usually. But cannot repeat the error exactly using the same input.

Hudayday commented 4 months ago

Falcon and other model also suffer from same issues when TP level greater than 4 with int4 quantization

Hudayday commented 4 months ago

The bug occurs with various models and different types of quantization (including float16) when using tp = 4 or tp = 8. Occasionally, SM Utilization spikes to 100% and the system completely freezes. The runner.generate() function never produces any output and GPU utilization remains at 100%.

byshiue commented 4 months ago

Could you try adding --use_custom_all_reduce disable during building engine?

Hudayday commented 4 months ago

Could you try adding --use_custom_all_reduce disable during building engine?

The issue still happens when disabling the use_custom_all_reduce. It happens randomly after running hundreds of batch = 1 requests. Each request is done independently so there is no concurrency.

With use_custom_all_reduce disabled, the issue happens less frequently but it doesn't disappear.

A6000s are connected by PCI-e.