NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.11k stars 896 forks source link

Model Performance Degraded when using BFLOAT16 LoRa Adapters #1957

Open TheCodeWrangler opened 1 month ago

TheCodeWrangler commented 1 month ago

System Info

2X L4 GPUs

Docker Image: nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3

Who can help?

@juney-nvidia @kaiyux

Information

Tasks

Reproduction

I have a fine tuned set of weights trained using huggingface:

I have updated the base model's config.json

   "rope_scaling": {
        "type": "linear",
        "factor": 1.75
    },
    "rope_theta": 875000,  

Then compiled the base model from within nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3

by running:

python3 convert_checkpoint.py \
--model_dir ${BASE_MODEL_DIR} \
--output_dir /converted_base_model \
--rotary_base 875000 \
--dtype bfloat16 \
--tp_size 2

trtllm-build \
--checkpoint_dir /converted_base_model \
--max_input_len=13568 \
--max_num_tokens=14336 \
--max_output_len=768 \
--tp_size 2 \
--max_batch_size 4 \
--max_beam_width 3 \
--lora_plugin bfloat16 \
--gemm_plugin bfloat16 \
--lora_target_modules attn_q attn_k attn_v attn_dense mlp_h_to_4h mlp_gate mlp_4h_to_h \
--max_lora_rank 32 \
--gpt_attention_plugin bfloat16 \
--paged_kv_cache enable \
--use_paged_context_fmha enable \
--multi_block_mode enable \
--remove_input_padding enable \
--use_custom_all_reduce disable \
--cluster_key L4 \
--workers=2 \
--context_fmha enable \
--lookup_plugin bfloat16 \
--enable_xqa enable \
--output_dir ${ENGINE_DIR}

I am then performing generations using trition-inference-server using the warmups described above

Generated outputs differ significantly from those by using the same model in huggingface.

If the same process is repeated but the model is first "merged and unloaded" before compilation and then served without LoRa weights I get the exact same output from triton/tensorRT-LLM.

Expected behavior

Outputs of model are the same with LoRa weights as they are with a merged and unloaded model. These results also are expected to nearly match the results when ran in huggingface.

actual behavior

Rouge2 scores between huggingface outputs and LoRa weights served models are below 0.6 (other metrics would also demonstrate the large shift in outputs that are occuring)

additional notes

I noticed that the scale being applied to the "out" weights in the hf_lora_convert.py script. It appears that the "A" and "B" matrices (huggingface weights notation) correspond to ("in" and "out") in TensorRT-LLM notation.

From looking at the RS Scaling LoRa [equation 2] paper it seems that I should be able to get the same results from applying the scaling to "A/in" OR "B/out". In practice applying scaling to B gets results similar to the fine-tuning objective (but still significantly shifted) but applying scale to only the "A/in" result in seemingly random token generation.

fan-niu commented 1 month ago

@kaiyux same issue with llama-3-8b added rope scaling, can you help to solve this problem?

juney-nvidia commented 1 month ago

Thanks for reporting this, our engineer will start looking into this issue soon.

TheCodeWrangler commented 1 month ago

Any updates?! I see a new issue that looks the same as well but in my case I have now tried with the 24.07 tag and the results are the same

TheCodeWrangler commented 4 weeks ago

Wondering if there is any progress?