NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.3k stars 927 forks source link

Support for Falcon 7B: HF to TRT weight Conversion fails #902

Open amir1m opened 8 months ago

amir1m commented 8 months ago

System Info

CPU:X_86_64 GPU: A10 OS: Ubuntu 22.04

Who can help?

@Tracin @byshiue please help.

Information

Tasks

Reproduction

  1. I have PEFT fine tuned Falcon instruct model as

    quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
    )
    model_name = "tiiuae/falcon-7b-instruct"
    tokenizer = AutoTokenizer.from_pretrained(model_name,padding_side='left')
    model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=quant_config, device_map="auto", cache_dir="./model_cache/falcon7B_1", trust_remote_code=True)
    model.gradient_checkpointing_enable()
  2. Then get the PEFT model and fine tune on GPU server:

from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

model = prepare_model_for_kbit_training(model)

config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["dense_4h_to_h", "dense", "query_key_value", "dense_h_to_4h"]
)

model = get_peft_model(model, config)
  1. This models is trained on custom dataset.
  2. Merge the base model with PEFT adapters to get entire model in one directory 5. Using convert utility with RT-LLM change the HF weights to RT LLM tensors as,

python3 convert_checkpoint.py --model_dir ./merged/ --dtype bfloat16 --output_dir ./trt_ckpt/bf16/1-gpu/

0.7.1 [01/17/2024-13:10:40] WARNING: You are currently loading Falcon using legacy code contained in the model repository. Falcon has now been fully ported into the Hugging Face transformers library. For the most up-to-date and high-performance version of the Falcon model code, please update to the latest version of transformers and then load the model without the trust_remote_code=True argument.

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.98it/s] Traceback (most recent call last): File "/code/tensorrt_llm/examples/falcon/convert_checkpoint.py", line 1120, in covert_and_save(rank) File "/code/tensorrt_llm/examples/falcon/convert_checkpoint.py", line 1097, in covert_and_save weights = convert_hf_falcon( File "/code/tensorrt_llm/examples/falcon/convert_checkpoint.py", line 379, in convert_hf_falcon qkv_w = split_qkv_weight(qkv_weight, File "/code/tensorrt_llm/examples/falcon/convert_checkpoint.py", line 265, in split_qkv_weight weight = reorder_qkv_weight_or_bias(weight, File "/code/tensorrt_llm/examples/falcon/convert_checkpoint.py", line 214, in reorder_qkv_weight_or_bias assert weight.shape[0] == num_kv_heads num_group_heads head_dim, \ AssertionError: 4672 != 71 3 64

6. After changing the num_kv_heads=1  in /code/tensorrt_llm/examples/falcon/convert_checkpoint.py", line 214 however it seems to proceed and worked.

**7. Build the TRT Engine:**

root@myhostname-release:/code/tensorrt_llm/examples/falcon# trtllm-build --checkpoint_dir ./trt_ckpt/bf16/1-gpu/ --use_gemm_plugin bfloat16 --remove_input_padding --use_gpt_attention_plugin bfloat16 --enable_context_fmha --output_dir ./trt_engines/bf16/1-gpu/ [01/17/2024-13:29:18] [TRT-LLM] [I] Context FMHA Enabled [01/17/2024-13:29:18] [TRT-LLM] [I] Remove Padding Enabled Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 354, in main parallel_build(source, build_config, args.output_dir, workers, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 284, in parallel_build build_and_save(rank, rank % workers, ckpt_dir, build_config, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 258, in build_and_save engine = build(build_config, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 243, in build model = model_cls.from_checkpoint(ckpt_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 347, in from_checkpoint model.load(weights) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 363, in load param.value = weights[name] File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 113, in value assert v.shape == self._shape, \ AssertionError: The value updated is not the same shape as the original. Updated: (4672, 4544), original: (13632, 4544)

8. After commenting out the asserts (just to test) at line#113 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py it again seems to build the engine.

10. Now when I try to run the run.py on the model with my tokenizer it just generates junk characters.

These same steps actually worked when I tried last month on another machine without any issues. At that time there was build.by that needs to be built and used. 

Is this a regression?

### Expected behavior

For PEFT fine tuned Falcon-7B model
1. Conversion from HF to RT should work 
2. Building the TRT engine should work.

### actual behavior

While converting from HF to TRT error thrown is:

assert weight.shape[0] == num_kv_heads num_group_heads head_dim, \ AssertionError: 4672 != 71 3 64



### additional notes

This is a possible regression as I was able to build the Falcon 7B TRT engine successfully on another machine last month.
syuoni commented 8 months ago

Hi, could you give a script to reproduce the model in "./merged/"? The training step 3 can be skipped.

amir1m commented 8 months ago

Hi, could you give a script to reproduce the model in "./merged/"? The training step 3 can be skipped.

Hi @syuoni , Thanks for your response! I am unable to share the model. However, I meant to say in step#3 that we fine-tuned the model.

I again tried today and still the same result.

syuoni commented 8 months ago

Hi @amir1m , sorry that it seems my question was a little unclear. I mean that you may give a script for steps 1, 2 and 4, which should produce a model in "./merged/". So that I can quickly reproduce the issue you encountered.

You don't need to share the model, so I saied "the finetuning step 3 can be skipped" (as it relates to your "custom dataset"). But the model in "./merged/" should have exactly the same structures with the model you have.

syuoni commented 8 months ago

I see. The base model in "./merged/" is quantized.

You can reload falcon-7b in bf16 precision, merge the lora weights, and then save to "./merged-bf16/". Then, run

python3 convert_checkpoint.py --model_dir ./merged-bf16/ --dtype bfloat16 --output_dir ./trt_ckpt/bf16/1-gpu/