NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.49k stars 962 forks source link

LLaMA 7B AWQ Quantaization #460

Closed kamalkraj closed 6 months ago

kamalkraj commented 11 months ago

The instruction for quantization seems incorrect. - https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#awq

using the below command

# Quantize HF LLaMA 7B checkpoint into INT4 AWQ format
python quantize.py --model_dir ./tmp/llama/7B \
                --dtype float16 \
                --qformat int4_awq \
                --export_path ./llama-7b-4bit-gs128-awq.pt \
                --calib_size 32

This result in a folder rather than a single file

tmp/llama-7b-4bit-gs128-awq.pt/
├── llama_tp1.json
└── llama_tp1_rank0.npz

Using the next command

python build.py --model_dir ./tmp/llama/7B/ \
                --quant_ckpt_path ./llama-7b-4bit-gs128-awq.pt \
                --dtype float16 \
                --remove_input_padding \
                --use_gpt_attention_plugin float16 \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --use_weight_only \
                --weight_only_precision int4_awq \
                --per_group \
                --output_dir ./tmp/llama/7B/trt_engines/int4_AWQ/1-gpu/

results in error

 /code/tensorrt_llm/examples/llama/weight.py:1188 in load_from_awq_llama                          │
│                                                                                                  │
│   1185 │   tik = time.time()                                                                     │
│   1186 │                                                                                         │
│   1187 │   if quant_ckpt_path.endswith(".pt"):                                                   │
│ ❱ 1188 │   │   awq_llama = torch.load(quant_ckpt_path)                                           │
│   1189 │   │   awq_prefix = "model."                                                             │
│   1190 │   │   awq_suffix_list = [                                                               │
│   1191 │   │   │   ".weight",                                                                    │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/torch/serialization.py:988 in load                       │
│                                                                                                  │
│    985 │   if 'encoding' not in pickle_load_args.keys():                                         │
│    986 │   │   pickle_load_args['encoding'] = 'utf-8'                                            │
│    987 │                                                                                         │
│ ❱  988 │   with _open_file_like(f, 'rb') as opened_file:                                         │
│    989 │   │   if _is_zipfile(opened_file):                                                      │
│    990 │   │   │   # The zipfile reader is going to advance the current file position.           │
│    991 │   │   │   # If we want to actually tail call to torch.jit.load, we need to              │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/torch/serialization.py:437 in _open_file_like            │
│                                                                                                  │
│    434                                                                                           │
│    435 def _open_file_like(name_or_buffer, mode):                                                │
│    436 │   if _is_path(name_or_buffer):                                                          │
│ ❱  437 │   │   return _open_file(name_or_buffer, mode)                                           │
│    438 │   else:                                                                                 │
│    439 │   │   if 'w' in mode:                                                                   │
│    440 │   │   │   return _open_buffer_writer(name_or_buffer)                                    │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/torch/serialization.py:418 in __init__                   │
│                                                                                                  │
│    415                                                                                           │
│    416 class _open_file(_opener):                                                                │
│    417 │   def __init__(self, name, mode):                                                       │
│ ❱  418 │   │   super().__init__(open(name, mode))                                                │
│    419 │                                                                                         │
│    420 │   def __exit__(self, *args):                                                            │
│    421 │   │   self.file_like.close()                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
IsADirectoryError: [Errno 21] Is a directory: 'llama-7b-4bit-gs128-awq.pt'

Update quant_ckpt_path command run successfully without any error

python build.py --model_dir ./tmp/llama/7B/ \
                --quant_ckpt_path ./llama-7b-4bit-gs128-awq.pt/llama_tp1_rank0.npz \
                --dtype float16 \
                --remove_input_padding \
                --use_gpt_attention_plugin float16 \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --use_weight_only \
                --weight_only_precision int4_awq \
                --per_group \
                --output_dir ./tmp/llama/7B/trt_engines/int4_AWQ/1-gpu/

But running summarization produce produce score of 0

 [11/23/2023-15:06:57] [TRT-LLM] [W] Found pynvml==11.4.1. Please use pynvml>=11.5.0 to get accurate memory usage
[11/23/2023-15:06:57] [TRT-LLM] [I] Load tokenizer takes: 0.05814027786254883 sec
[11/23/2023-15:06:59] [TRT] [I] Loaded engine size: 3506 MiB
[11/23/2023-15:06:59] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +6, GPU +10, now: CPU 3938, GPU 3770 (MiB)
[11/23/2023-15:06:59] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 3940, GPU 3780 (MiB)
[11/23/2023-15:07:00] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +3501, now: CPU 0, GPU 3501 (MiB)
[11/23/2023-15:07:00] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 3956, GPU 5240 (MiB)
[11/23/2023-15:07:00] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 3956, GPU 5248 (MiB)
[11/23/2023-15:07:01] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3501 (MiB)
[11/23/2023-15:07:01] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 3990, GPU 5268 (MiB)
[11/23/2023-15:07:01] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 3990, GPU 5278 (MiB)
[11/23/2023-15:07:01] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3501 (MiB)
[11/23/2023-15:07:01] [TRT-LLM] [I] Load engine takes: 3.5130395889282227 sec
[11/23/2023-15:07:02] [TRT-LLM] [I] ---------------------------------------------------------
[11/23/2023-15:07:02] [TRT-LLM] [I] TensorRT-LLM Generated : 
[11/23/2023-15:07:02] [TRT-LLM] [I]  Input : ['(CNN)James Best, best known for his portrayal of bumbling sheriff Rosco P. Coltrane on TV\'s "The Dukes of Hazzard," died Monday after a brief illness. He was 88. Best died in hospice in Hickory, North Carolina, of complications from pneumonia, said Steve Latshaw, a longtime friend and Hollywood colleague. Although he\'d been a busy actor for decades in theater and in Hollywood, Best didn\'t become famous until 1979, when "The Dukes of Hazzard\'s" cornpone charms began beaming into millions of American homes almost every Friday night. For seven seasons, Best\'s Rosco P. Coltrane chased the moonshine-running Duke boys back and forth across the back roads of fictitious Hazzard County, Georgia, although his "hot pursuit" usually ended with him crashing his patrol car. Although Rosco was slow-witted and corrupt, Best gave him a childlike enthusiasm that got laughs and made him endearing. His character became known for his distinctive "kew-kew-kew" chuckle and for goofy catchphrases such as "cuff \'em and stuff \'em!" upon making an arrest. Among the most popular shows on TV in the early \'80s, "The Dukes of Hazzard" ran until 1985 and spawned TV movies, an animated series and video games. Several of Best\'s "Hazzard" co-stars paid tribute to the late actor on social media. "I laughed and learned more from Jimmie in one hour than from anyone else in a whole year," co-star John Schneider, who played Bo Duke, said on Twitter. "Give Uncle Jesse my love when you see him dear friend." "Jimmy Best was the most constantly creative person I have ever known," said Ben Jones, who played mechanic Cooter on the show, in a Facebook post. "Every minute of his long life was spent acting, writing, producing, painting, teaching, fishing, or involved in another of his life\'s many passions." Born Jewel Guy on July 26, 1926, in Powderly, Kentucky, Best was orphaned at 3 and adopted by Armen and Essa Best, who renamed him James and raised him in rural Indiana. Best served in the Army during World War II before launching his acting career. In the 1950s and 1960s, he accumulated scores of credits, playing a range of colorful supporting characters in such TV shows as "The Twilight Zone," "Bonanza," "The Andy Griffith Show" and "Gunsmoke." He later appeared in a handful of Burt Reynolds\' movies, including "Hooper" and "The End." But Best will always be best known for his "Hazzard" role, which lives on in reruns. "Jimmie was my teacher, mentor, close friend and collaborator for 26 years," Latshaw said. "I directed two of his feature films, including the recent \'Return of the Killer Shrews,\' a sequel he co-wrote and was quite proud of as he had made the first one more than 50 years earlier." People we\'ve lost in 2015 . CNN\'s Stella Chan contributed to this story.']
[11/23/2023-15:07:02] [TRT-LLM] [I] 
 Reference : ['James Best, who played the sheriff on "The Dukes of Hazzard," died Monday at 88 .\n"Hazzard" ran from 1979 to 1985 and was among the most popular shows on TV .']
[11/23/2023-15:07:02] [TRT-LLM] [I] 
 Output : [['\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n']]
[11/23/2023-15:07:02] [TRT-LLM] [I] ---------------------------------------------------------
[11/23/2023-15:07:26] [TRT-LLM] [I] TensorRT-LLM (total latency: 22.981457471847534 sec)
[11/23/2023-15:07:26] [TRT-LLM] [I] TensorRT-LLM beam 0 result
[11/23/2023-15:07:26] [TRT-LLM] [I]   rouge1 : 0.0
[11/23/2023-15:07:26] [TRT-LLM] [I]   rouge2 : 0.0
[11/23/2023-15:07:26] [TRT-LLM] [I]   rougeL : 0.0
[11/23/2023-15:07:26] [TRT-LLM] [I]   rougeLsum : 0.0
eycheung commented 10 months ago

@kamalkraj I found that this works if you use the release/0.5.0 branch instead of the main branch.

The issue seems to be that this line https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/tensorrt_llm/models/quantized/ammo.py#L84 is removed in the main branch, which saves the state dict using torch directly if we're using int4_awq.

I tried adding this line back to the main branch and it worked again for me.

kamalkraj commented 10 months ago

Thanks @eycheung

byshiue commented 6 months ago

The issue is fixed in latest main branch. Close this bug.