NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.1k stars 894 forks source link

[Feature] quantize_by_modelopt.py get_tokenizer is not suitable for CodeQwen1.5 7B Chat #1953

Closed Yuchen-Cao closed 1 month ago

Yuchen-Cao commented 1 month ago

System Info

GPU NVIDIA L20

Who can help?

No response

Information

Tasks

Reproduction

I am trying to quantize CodeQwen1.5 7B Chat to FP8 using a modified version of the example quantization script:

python quantization/quantize.py --model_dir /mnt/models/CodeQwen1.5-7B-Chat \
                                --dtype float16 \
                                --qformat fp8 \
                                --kv_cache_dtype fp8 \
                                --output_dir /mnt/trt_models/codeqwen1.5_7b_checkpoint_1gpu_fp8_fp8kv \
                                --calib_size 512 \
                                --calib_dataset /mnt/dataset/cnn_dailymail

Expected behavior

The outside quantize.py will use quantize_and_export() to run quantization, and it is defined inside https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/quantization/quantize_by_modelopt.py

get_tokenizer will automatically read the tokenizer from my model_dir and set the pad_token as well as the eos_token.

actual behavior

But it failed to set the pad_token:

[07/16/2024-13:46:30] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
[07/16/2024-13:46:30] [TRT-LLM] [I] Starting TensorRT-LLM init.
[TensorRT-LLM][INFO] Set logger level by INFO
[07/16/2024-13:46:30] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061100
Initializing model from /mnt/models/CodeQwen1.5-7B-Chat
[07/16/2024-13:47:14] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:28<00:00,  7.20s/it]
[TensorRT-LLM][WARNING] The manually set model data type is torch.float16, but the data type of the HuggingFace model is torch.bfloat16.
Initializing tokenizer from /mnt/models/CodeQwen1.5-7B-Chat
Traceback (most recent call last):
  File "quantization/quantize.py", line 90, in <module>
    quantize_and_export(
  File "/opt/conda/lib/python3.8/site-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 289, in quantize_and_export
    tokenizer = get_tokenizer(model_dir,
  File "/opt/conda/lib/python3.8/site-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 147, in get_tokenizer
    assert tokenizer.pad_token is not None, f"Pad token for {model_type} cannot be set!"
AssertionError: Pad token for qwen cannot be set!

additional notes

I commented out some lines except for the AutoTokenizer.from_pretrained() to get this case worked.

def get_tokenizer(ckpt_path, max_seq_length=2048, model_type=None):
    print(f"Initializing tokenizer from {ckpt_path}")
    tokenizer = AutoTokenizer.from_pretrained(
        ckpt_path,
        model_max_length=max_seq_length,
        padding_side="left",
        trust_remote_code=True,
    )
    # if model_type and model_type == "qwen":
    #     # qwen use token id 151643 as pad and eos tokens
    #     tokenizer.pad_token = tokenizer.convert_ids_to_tokens(151643)
    #     tokenizer.eos_token = tokenizer.convert_ids_to_tokens(151643)

    # # can't set attribute 'pad_token' for "<unk>"
    # if tokenizer.pad_token != "<unk>":  # nosec B105
    #     tokenizer.pad_token = tokenizer.eos_token
    # if tokenizer.pad_token is None:
    #     tokenizer.pad_token = tokenizer.eos_token
    # assert tokenizer.pad_token is not None, f"Pad token for {model_type} cannot be set!"

    return tokenizer

I know that commenting out these lines will certainly affect other model's conversion. It seems there needs to be a fix on this function to support CodeQwen1.5.

QiJune commented 1 month ago

@Tracin Could you please have a look? Thanks

Tracin commented 1 month ago

@Yuchen-Cao Thanks! We have fixed this.