NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.37k stars 796 forks source link

quantize.py fails to export important data to config.json (eg rotary scaling) #1676

Open janpetrov opened 1 month ago

janpetrov commented 1 month ago

System Info

4x NVIDIA H100, TensorRT-LLM backend 0.9.0

Who can help?

@Tracin

Information

Tasks

Reproduction

(1) Have a HF transformers model with linear rope scaling.

(2) Edit /usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py, is_linear to (adding the and ("Rotary"... part)

def is_linear(module: nn.Module) -> bool:
    """Returns whether the module is a linear layer."""
    return any([k in type(module).__name__ for k in ["Linear", "Conv1D", "NormHead"]]) and ("Rotary" not in type(module).__name__)

so that the rope scaling model is exported (without crashing on an error that weights cannot be exported form the Rotary scaling layer, see this issue

(3) then run, as recommended here

python examples/quantization/quantize.py \
    --model_dir "$MODEL_DIR" \
    --dtype bfloat16 \
    --output_dir "$TMP_DIR" \
    --tp_size 2 \
    --qformat fp8 \
    --kv_cache_dtype fp8 \
    --calib_size 512

Expected behavior

quantize.py should generate a detailed config.json file in the output dir. The subsequent run of

trtllm-build \
    --checkpoint_dir "$TMP_DIR" \
    --gpt_attention_plugin bfloat16 \
    --gemm_plugin bfloat16 \
    --max_input_len 16384 \
    --max_output_len 16384 \
    --max_batch_size 8 \
    --strongly_typed \
    --workers 2 \
    --output_dir "$OUTPUT_DIR" \
    --multi_block_mode enable

should build a well-working engine.

actual behavior

The config.json generated by quantize.py contains just the following (please note eg the rope scaling missing). The engine built by trtllm-build generates nonsense.

{
    "producer": {
        "name": "ammo",
        "version": "0.7.4"
    },
    "architecture": "LlamaForCausalLM",
    "dtype": "bfloat16",
    "num_hidden_layers": 80,
    "num_attention_heads": 64,
    "num_key_value_heads": 8,
    "hidden_size": 8192,
    "norm_epsilon": 1e-05,
    "vocab_size": 32000,
    "max_position_embeddings": 4096,
    "hidden_act": "silu",
    "use_parallel_embedding": true,
    "embedding_sharding_dim": 0,
    "quantization": {
        "quant_algo": "FP8",
        "kv_cache_quant_algo": "FP8"
    },
    "mapping": {
        "world_size": 2,
        "tp_size": 2,
        "pp_size": 1
    },
    "head_size": 128,
    "intermediate_size": 28672,
    "position_embedding_type": "rope_gpt_neox",
    "rotary_base": 10000.0
}

additional notes

When I edit the config.json to have the following contents and then re-run trtllm-build, the resulting engine starts to generate fine text.

{
    "producer": {
        "name": "ammo",
        "version": "0.7.4"
    },
    "architecture": "LlamaForCausalLM",
    "dtype": "bfloat16",
    "logits_dtype": "float32",
    "vocab_size": 32000,
    "max_position_embeddings": 4096,
    "hidden_size": 8192,
    "num_hidden_layers": 80,
    "num_attention_heads": 64,
    "num_key_value_heads": 8,
    "head_size": 128,
    "hidden_act": "silu",
    "intermediate_size": 28672,
    "norm_epsilon": 1e-05,
    "position_embedding_type": "rope_gpt_neox",
    "use_parallel_embedding": true,
    "embedding_sharding_dim": 0,
    "mapping": {
        "world_size": 2,
        "tp_size": 2,
        "pp_size": 1
    },
    "quantization": {
        "quant_algo": "FP8",
        "kv_cache_quant_algo": "FP8"
    },
    "rotary_scaling": {
        "factor": 4.0,
        "type": "linear"
    },
    "moe_normalization_mode": null,
    "rotary_base": 10000.0,
    "moe_num_experts": 0,
    "moe_top_k": 0,
    "moe_tp_mode": 2,
    "attn_bias": false,
    "disable_weight_only_quant_plugin": false,
    "mlp_bias": false
}

Please note that when the input to trtllm-build is generated by examples/llama/convert_checkpoint.py (and not by examples/quantization/quanitize.py) then the config.json looks as follows. This is for the same model but without quantization. Please note much richer data, including rotary scaling.

 {
    "architecture": "LlamaForCausalLM",
    "dtype": "bfloat16",
    "logits_dtype": "float32",
    "vocab_size": 32000,
    "max_position_embeddings": 4096,
    "hidden_size": 8192,
    "num_hidden_layers": 80,
    "num_attention_heads": 64,
    "num_key_value_heads": 8,
    "head_size": 128,
    "hidden_act": "silu",
    "intermediate_size": 28672,
    "norm_epsilon": 1e-05,
    "position_embedding_type": "rope_gpt_neox",
    "use_parallel_embedding": false,
    "embedding_sharding_dim": 0,
    "share_embedding_table": false,
    "mapping": {
        "world_size": 4,
        "tp_size": 4,
        "pp_size": 1
    },
    "quantization": {
        "quant_algo": null,
        "kv_cache_quant_algo": null,
        "group_size": 128,
        "smoothquant_val": null,
        "has_zero_point": false,
        "pre_quant_scale": false,
        "exclude_modules": [
            "lm_head"
        ]
    },
    "kv_dtype": "bfloat16",
    "rotary_scaling": {
        "factor": 4.0,
        "type": "linear"
    },
    "moe_normalization_mode": null,
    "rotary_base": 10000.0,
    "moe_num_experts": 0,
    "moe_top_k": 0,
    "moe_tp_mode": 2,
    "attn_bias": false,
    "disable_weight_only_quant_plugin": false,
    "mlp_bias": false
}
byshiue commented 1 month ago

Could you share what model do you use?

janpetrov commented 1 month ago

thank you, https://huggingface.co/meta-llama/Llama-2-70b-hf , finetuned (w/o any change in architecture) and exported in bfloat16

byshiue commented 1 month ago

It looks the rope_scaling of llama-2-70b-hf is NULL

{
  "architectures": [
    "LlamaForCausalLM"
  ],
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 28672,
  "max_position_embeddings": 2048,
  "model_type": "llama",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.31.0",
  "use_cache": true,
  "vocab_size": 32000
}
janpetrov commented 1 month ago

Please excuse that I have not mentioned this earlier explicitly. We have finetuned the model with changing rope scaling. Please see below the config.json for our finetuned model saved in the huggingface format (this is in the $MODEL_DIR directory, as referred above, see the

python examples/quantization/quantize.py \
    --model_dir "$MODEL_DIR"

part.

{
  "_name_or_path": "OUR_PATH_HERE",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 28672,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 4.0,
    "type": "linear"
  },
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.39.1",
  "use_cache": false,
  "vocab_size": 32000
}
byshiue commented 1 month ago

Thank you for the reply. I try to change the config.json of exising HF model, but it leads to failure during converting. So, it looks I cannot change the config directly to reproduce this issue, but need to have a finetune model which is tuned by rope layer. Do you know any layer which has non-null rope_scaling to help reproducing the issue?

janpetrov commented 1 month ago

Thank you for your reply. Please give me few days, I will prepare for you (simple instruction how to obtain) a model with rope_scaling in config.json that converts.

wxsms commented 4 weeks ago

The deepseek-coder 33b model is using rope scaling, and also llama architecture, which has same problem describe here, maybe you can try this model directly: https://huggingface.co/deepseek-ai/deepseek-coder-33b-instruct/blob/main/config.json

byshiue commented 3 weeks ago

Thank you for the sharing. I will take a try.

byshiue commented 3 weeks ago

Unfortunately, TRT-LLM does not support deepseek yet, and hence I cannot reproduce the issue on the checkpoint.

wxsms commented 3 weeks ago

Unfortunately, TRT-LLM does not support deepseek yet, and hence I cannot reproduce the issue on the checkpoint.

You may use the Llama workflow for Deepseek models. It works for int8 weight only quant (engine build + inference), which provider by llama/convert_checkpoint.py (have to specify the RoPE params).

However the FP8 quant provided by quantization/quantize.py has the same problem described here. i.e. engine build works but inference generate nonsense.

chenxu2048 commented 2 weeks ago

Thank you for the reply. I try to change the config.json of exising HF model, but it leads to failure during converting. So, it looks I cannot change the config directly to reproduce this issue, but need to have a finetune model which is tuned by rope layer. Do you know any layer which has non-null rope_scaling to help reproducing the issue?

Hi @byshiue. FYI you can also try this model: https://huggingface.co/Yukang/LongAlpaca-70B. It is mentioned at https://github.com/NVIDIA/TensorRT-LLM/tree/db4edea/examples/llama#long-context-length.

byshiue commented 2 weeks ago

Thank you for the sharing. We could reproduce the issue now, and we are investigating the issue now.