Open janpetrov opened 1 month ago
Could you share what model do you use?
thank you, https://huggingface.co/meta-llama/Llama-2-70b-hf , finetuned (w/o any change in architecture) and exported in bfloat16
It looks the rope_scaling of llama-2-70b-hf is NULL
{
"architectures": [
"LlamaForCausalLM"
],
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 8192,
"initializer_range": 0.02,
"intermediate_size": 28672,
"max_position_embeddings": 2048,
"model_type": "llama",
"num_attention_heads": 64,
"num_hidden_layers": 80,
"num_key_value_heads": 8,
"pad_token_id": 0,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.31.0",
"use_cache": true,
"vocab_size": 32000
}
Please excuse that I have not mentioned this earlier explicitly. We have finetuned the model with changing rope scaling. Please see below the config.json for our finetuned model saved in the huggingface format (this is in the $MODEL_DIR directory, as referred above, see the
python examples/quantization/quantize.py \
--model_dir "$MODEL_DIR"
part.
{
"_name_or_path": "OUR_PATH_HERE",
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 8192,
"initializer_range": 0.02,
"intermediate_size": 28672,
"max_position_embeddings": 4096,
"model_type": "llama",
"num_attention_heads": 64,
"num_hidden_layers": 80,
"num_key_value_heads": 8,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": {
"factor": 4.0,
"type": "linear"
},
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.39.1",
"use_cache": false,
"vocab_size": 32000
}
Thank you for the reply. I try to change the config.json
of exising HF model, but it leads to failure during converting. So, it looks I cannot change the config directly to reproduce this issue, but need to have a finetune model which is tuned by rope layer. Do you know any layer which has non-null rope_scaling
to help reproducing the issue?
Thank you for your reply. Please give me few days, I will prepare for you (simple instruction how to obtain) a model with rope_scaling in config.json
that converts.
The deepseek-coder 33b model is using rope scaling, and also llama architecture, which has same problem describe here, maybe you can try this model directly: https://huggingface.co/deepseek-ai/deepseek-coder-33b-instruct/blob/main/config.json
Thank you for the sharing. I will take a try.
Unfortunately, TRT-LLM does not support deepseek yet, and hence I cannot reproduce the issue on the checkpoint.
Unfortunately, TRT-LLM does not support deepseek yet, and hence I cannot reproduce the issue on the checkpoint.
You may use the Llama workflow for Deepseek models. It works for int8 weight only quant (engine build + inference), which provider by llama/convert_checkpoint.py
(have to specify the RoPE params).
However the FP8 quant provided by quantization/quantize.py
has the same problem described here. i.e. engine build works but inference generate nonsense.
Thank you for the reply. I try to change the
config.json
of exising HF model, but it leads to failure during converting. So, it looks I cannot change the config directly to reproduce this issue, but need to have a finetune model which is tuned by rope layer. Do you know any layer which has non-nullrope_scaling
to help reproducing the issue?
Hi @byshiue. FYI you can also try this model: https://huggingface.co/Yukang/LongAlpaca-70B. It is mentioned at https://github.com/NVIDIA/TensorRT-LLM/tree/db4edea/examples/llama#long-context-length.
Thank you for the sharing. We could reproduce the issue now, and we are investigating the issue now.
System Info
4x NVIDIA H100, TensorRT-LLM backend 0.9.0
Who can help?
@Tracin
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
(1) Have a HF transformers model with linear rope scaling.
(2) Edit /usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py, is_linear to (adding the
and ("Rotary"...
part)so that the rope scaling model is exported (without crashing on an error that weights cannot be exported form the Rotary scaling layer, see this issue
(3) then run, as recommended here
Expected behavior
quantize.py should generate a detailed config.json file in the output dir. The subsequent run of
should build a well-working engine.
actual behavior
The config.json generated by quantize.py contains just the following (please note eg the rope scaling missing). The engine built by trtllm-build generates nonsense.
additional notes
When I edit the config.json to have the following contents and then re-run trtllm-build, the resulting engine starts to generate fine text.
Please note that when the input to trtllm-build is generated by examples/llama/convert_checkpoint.py (and not by examples/quantization/quanitize.py) then the config.json looks as follows. This is for the same model but without quantization. Please note much richer data, including rotary scaling.