Medusa with Mixtral 8x7B

v-dicicco commented 1 week ago

Hello! Does TensorRT-LLM supports Medusa with Mixtral 8x7B?

My understanding is that right now the Medusa convert_checkpoint.py doesn't support Mixtral (e.g: it lacks the moe config and also other MoE related arguments contained in the LLama conversion script) but I have the feeling it should (in theory) work since MedusaForCausalLm is based on LLaMAForCausalLM, and that the convert_checkpoint.py of Medusa can be aligned to the one used by LLama (for some specific configurations).

Would be helpful any hints in this direction :)

Thanks!

nv-guomingz commented 1 week ago

If the mixtral 8x7B has its own mesuda_model likemedusa-vicuna-7b-v1.3 for vicuna-7b-v1.3, then we can have a try on enabling meduas for MoE model.

v-dicicco commented 1 week ago

Thanks for the answer! There is a medusa model already trained for Mixtral-Instruct v0.1 here https://huggingface.co/text-generation-inference/Mixtral-8x7B-Instruct-v0.1-medusa for TGI

The format is the same as the original vicuna heads, since I was able to use the heads for miStral from the same project with TRT-LLM (https://huggingface.co/text-generation-inference/Mistral-7B-Instruct-v0.2-medusa).

Do you already see other challenges other than fixing the convert_checkpoint.py script of Medusa? I'm trying to work on it right now

nv-guomingz commented 1 week ago

Thanks for the answer! There is a medusa model already trained for Mixtral-Instruct v0.1 here https://huggingface.co/text-generation-inference/Mixtral-8x7B-Instruct-v0.1-medusa for TGI

The format is the same as the original vicuna heads, since I was able to use the heads for miStral from the same project with TRT-LLM (https://huggingface.co/text-generation-inference/Mistral-7B-Instruct-v0.2-medusa).

Do you already see other challenges other than fixing the convert_checkpoint.py script of Medusa? I'm trying to work on it right now

No, We don't investigate the MoE + medusa yet. But my gut tells me that it could be done once we add the missing MoE supporting in convert_checkpoint.py.

skyCreateXian commented 1 week ago

The construction of baichuan2-7b medusa engine has been completed. Based on experience, the following suggestions are made:

You can refer to it https://github.com/NVIDIA/TensorRT-LLM/tree/2a115dae84f13daaa54727534daa837c534eceb4/examples/mixtral Checkpoint of llama called
According to medusa/checkpoint 2.1 Add functions such as medusa head load 2.2 Modify the content of medusa config in the code. Required options: 'architecture': 'MedusaForCausalLM'. Modify other configurations as needed

v-dicicco commented 5 days ago

Thanks for the reply @nv-guomingz @skyCreateXian.

I was able to modify LLaMa's convert_checkpoint.py adding Medusa weights, but I'm obtaining very poor inference performance, in stark contrast to my experience with Mistral 7B.

I proceeded in this way:

Created a custom MedusaForCausalLm.from_hugging_face() really similar to the one of LLaMAForCausalLM, that also updates the dict weights of the base model (Mixtral 8x7B) with the Medusa weights provided by load_medusa_hf() and instantiates and loads the weights into a MedusaForCausalLm object (taking also cares to create the right MedusaConf with the MedusaForCausalLM architecture field).
During engine creation with trtllm-build I needed to force silu as hidden_act of the medusa heads here, rather than taking its value from the model conf. This was needed since during the conversion of Mixtral in the previous step, TRT-LLM put swiglu as hidden_act rather than keeping the original silu value, that is incompatible with the Medusa heads.

To be sure the model is behaving in the expected way, I followed the Debug on E2E Models doc, marking as debugging output the medusa logits printing their prediction at each step, e.g running:

mpirun -np 2 --allow-run-as-root --oversubscribe \
    python run.py --engine_dir mixtral_instruct_v1_trt11_medusa \
                     --tokenizer_dir mixtral_tokenizer \
                     --max_output_len=14 \
                     --temperature 1.0 \
                     --input_text "[INST] Hello! [/INST]" \
                     --medusa_choices="[[0], [0, 0], [0, 0, 0]]" \
                     --use_py_session \
                     --debug_mode

Output

``` [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061100 [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061100 [06/23/2024-17:11:58] [TRT-LLM] [W] Implicitly setting MedusaConfig.skip_loading_weights = True [06/23/2024-17:11:58] [TRT-LLM] [W] Implicitly setting MedusaConfig.mup_width_multiplier = 1.0 [06/23/2024-17:11:58] [TRT-LLM] [I] Set dtype to float16. [06/23/2024-17:11:58] [TRT-LLM] [I] Set bert_attention_plugin to auto. [06/23/2024-17:11:58] [TRT-LLM] [I] Set gpt_attention_plugin to auto. [06/23/2024-17:11:58] [TRT-LLM] [I] Set gemm_plugin to float16. [06/23/2024-17:11:58] [TRT-LLM] [I] Set gemm_swiglu_plugin to None. [06/23/2024-17:11:58] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to None. [06/23/2024-17:11:58] [TRT-LLM] [I] Set identity_plugin to None. [06/23/2024-17:11:58] [TRT-LLM] [I] Set layernorm_quantization_plugin to None. [06/23/2024-17:11:58] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to None. [06/23/2024-17:11:58] [TRT-LLM] [I] Set nccl_plugin to float16. [06/23/2024-17:11:58] [TRT-LLM] [I] Set lookup_plugin to None. [06/23/2024-17:11:58] [TRT-LLM] [I] Set lora_plugin to None. [06/23/2024-17:11:58] [TRT-LLM] [I] Set weight_only_groupwise_quant_matmul_plugin to None. [06/23/2024-17:11:58] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to float16. [06/23/2024-17:11:58] [TRT-LLM] [I] Set quantize_per_token_plugin to False. [06/23/2024-17:11:58] [TRT-LLM] [I] Set quantize_tensor_plugin to False. [06/23/2024-17:11:58] [TRT-LLM] [I] Set moe_plugin to auto. [06/23/2024-17:11:58] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto. [06/23/2024-17:11:58] [TRT-LLM] [I] Set context_fmha to True. [71/1686] [06/23/2024-17:11:58] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [06/23/2024-17:11:58] [TRT-LLM] [I] Set paged_kv_cache to True. [06/23/2024-17:11:58] [TRT-LLM] [I] Set remove_input_padding to True. [06/23/2024-17:11:58] [TRT-LLM] [I] Set use_custom_all_reduce to False. [06/23/2024-17:11:58] [TRT-LLM] [I] Set reduce_fusion to False. [06/23/2024-17:11:58] [TRT-LLM] [I] Set multi_block_mode to False. [06/23/2024-17:11:58] [TRT-LLM] [I] Set enable_xqa to True. [06/23/2024-17:11:58] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [06/23/2024-17:11:58] [TRT-LLM] [I] Set tokens_per_block to 64. [06/23/2024-17:11:58] [TRT-LLM] [I] Set use_paged_context_fmha to False. [06/23/2024-17:11:58] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [06/23/2024-17:11:58] [TRT-LLM] [I] Set multiple_profiles to False. [06/23/2024-17:11:58] [TRT-LLM] [I] Set paged_state to True. [06/23/2024-17:11:58] [TRT-LLM] [I] Set streamingllm to False. [06/23/2024-17:11:58] [TRT-LLM] [W] Implicitly setting MedusaConfig.skip_loading_weights = True [06/23/2024-17:11:58] [TRT-LLM] [W] Implicitly setting MedusaConfig.mup_width_multiplier = 1.0 [06/23/2024-17:11:58] [TRT-LLM] [I] Set dtype to float16. [06/23/2024-17:11:58] [TRT-LLM] [I] Set bert_attention_plugin to auto. [06/23/2024-17:11:58] [TRT-LLM] [I] Set gpt_attention_plugin to auto. [06/23/2024-17:11:58] [TRT-LLM] [I] Set gemm_plugin to float16. [06/23/2024-17:11:58] [TRT-LLM] [I] Set gemm_swiglu_plugin to None. [06/23/2024-17:11:58] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to None. [06/23/2024-17:11:58] [TRT-LLM] [I] Set identity_plugin to None. [06/23/2024-17:11:58] [TRT-LLM] [I] Set layernorm_quantization_plugin to None. [06/23/2024-17:11:58] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to None. [06/23/2024-17:11:58] [TRT-LLM] [I] Set nccl_plugin to float16. [06/23/2024-17:11:58] [TRT-LLM] [I] Set lookup_plugin to None. [06/23/2024-17:11:58] [TRT-LLM] [I] Set lora_plugin to None. [06/23/2024-17:11:58] [TRT-LLM] [I] Set weight_only_groupwise_quant_matmul_plugin to None. [06/23/2024-17:11:58] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to float16. [06/23/2024-17:11:58] [TRT-LLM] [I] Set quantize_per_token_plugin to False. [06/23/2024-17:11:58] [TRT-LLM] [I] Set quantize_tensor_plugin to False. [06/23/2024-17:11:58] [TRT-LLM] [I] Set moe_plugin to auto. [06/23/2024-17:11:58] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto. [06/23/2024-17:11:58] [TRT-LLM] [I] Set context_fmha to True. [06/23/2024-17:11:58] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [06/23/2024-17:11:58] [TRT-LLM] [I] Set paged_kv_cache to True. [06/23/2024-17:11:58] [TRT-LLM] [I] Set remove_input_padding to True. [06/23/2024-17:11:58] [TRT-LLM] [I] Set use_custom_all_reduce to False. [06/23/2024-17:11:58] [TRT-LLM] [I] Set reduce_fusion to False. [06/23/2024-17:11:58] [TRT-LLM] [I] Set multi_block_mode to False. [06/23/2024-17:11:58] [TRT-LLM] [I] Set enable_xqa to True. [06/23/2024-17:11:58] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [06/23/2024-17:11:58] [TRT-LLM] [I] Set tokens_per_block to 64. [06/23/2024-17:11:58] [TRT-LLM] [I] Set use_paged_context_fmha to False. [06/23/2024-17:11:58] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [06/23/2024-17:11:58] [TRT-LLM] [I] Set multiple_profiles to False. [06/23/2024-17:11:58] [TRT-LLM] [I] Set paged_state to True. [06/23/2024-17:11:58] [TRT-LLM] [I] Set streamingllm to False. [06/23/2024-17:12:03] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 22948 (MiB) [06/23/2024-17:12:03] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 22948 (MiB) [06/23/2024-17:12:03] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 22948 (MiB) [06/23/2024-17:12:03] [TRT-LLM] [W] The paged KV cache in Python runtime is experimental. For performance and correctness, please, use C++ runtime. [06/23/2024-17:12:03] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 22948 (MiB) [06/23/2024-17:12:03] [TRT-LLM] [W] The paged KV cache in Python runtime is experimental. For performance and correctness, please, use C++ runtime. [06/23/2024-17:12:03] [TRT-LLM] [I] Load engine takes: 15.598329305648804 sec [06/23/2024-17:12:03] [TRT-LLM] [I] Load engine takes: 15.83333158493042 sec Step: 0 ==================== logits: 0 [22557, 0, 0, 0] - ['▁Hello', '', '', ''] medusa_heads: 0 [28808, 661, 28742, 0] - ['!', '▁It', "'", ''] 1 [0, 0, 0, 0] - ['', '', '', ''] 2 [0, 0, 0, 0] - ['', '', '', ''] Step: 1 ==================== logits: 0 [28808, 661, 28742, 28713] - ['!', '▁It', "'", 's'] medusa_heads: 0 [661, 28742, 28713, 5171] - ['▁It', "'", 's', '▁nice'] 1 [28742, 315, 5171, 298] - ["'", '▁I', '▁nice', '▁to'] 2 [28713, 5171, 298, 2647] - ['s', '▁nice', '▁to', '▁meet'] new_tokens: tensor([[28808, 661, 28742, 28713]], device='cuda:0', dtype=torch.int32) - ['!', '▁It', "'", 's'] Step: 2 ==================== logits: 0 [5171, 298, 2647, 368] - ['▁nice', '▁to', '▁meet', '▁you'] medusa_heads: 0 [298, 2647, 368, 28723] - ['▁to', '▁meet', '▁you', '.'] 1 [2647, 368, 28723, 1602] - ['▁meet', '▁you', '.', '▁How'] 2 [368, 28723, 1602, 736] - ['▁you', '.', '▁How', '▁there'] new_tokens: tensor([[5171, 298, 2647, 368]], device='cuda:0', dtype=torch.int32) - ['▁nice', '▁to', '▁meet', '▁you'] Step: 3 ==================== logits: 0 [28723, 1602, 541, 541] - ['.', '▁How', '▁can', '▁can'] medusa_heads: 0 [1602, 736, 315, 541] - ['▁How', '▁there', '▁I', '▁can'] 1 [736, 315, 1316, 541] - ['▁there', '▁I', '▁help', '▁can'] 2 [1545, 1316, 368, 6926] - ['▁something', '▁help', '▁you', '▁Where'] new_tokens: tensor([[28723, 1602, 541]], device='cuda:0', dtype=torch.int32) - ['.', '▁How', '▁can'] Step: 4 ==================== logits: 0 [315, 1316, 368, 3154] - ['▁I', '▁help', '▁you', '▁today'] medusa_heads: 0 [1316, 368, 3154, 28804] - ['▁help', '▁you', '▁today', '?'] 1 [368, 3154, 28804, 1691] - ['▁you', '▁today', '?', '▁Is'] 2 [3154, 28804, 1691, 736] - ['▁today', '?', '▁Is', '▁there'] new_tokens: tensor([[ 315, 1316, 368, 3154]], device='cuda:0', dtype=torch.int32) - ['▁I', '▁help', '▁you', '▁today'] Input [Text 0]: " ~~[INST] Hello! [/INST]" Output [Text 0 Beam 0]: "Hello! It's nice to meet you. How can I help" ```~~

Note:

At each generation step, I'm printing the argmax of both the Mixtral logits and the 3 medusa_heads, with also the new_tokens produced in the step

We requested a max_output_len=14 of tokens, and the request was fulfilled in "just" 5 generation steps (so the heads are working correctly in this example)

Despite this, if we try to benchmark the model with and without medusa, I'm obtaining very poor performance:

Benchmark C++ session w/ Medusa: latency 0.245 sec
``` mpirun -np 2 --allow-run-as-root --oversubscribe \ python run.py --engine_dir mixtral_instruct_v1_trt11_medusa \ --tokenizer_dir mixtral_tokenizer \ --max_output_len=14 \ --temperature 1.0 \ --input_text "[INST] Hello! [/INST]" \ --medusa_choices="[[0], [0, 0], [0, 0, 0]]" \ --run_profiling [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061100 [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061100 [06/23/2024-17:35:22] [TRT-LLM] [I] Load engine takes: 19.492470264434814 sec [06/23/2024-17:35:22] [TRT-LLM] [I] Load engine takes: 19.49226713180542 sec batch_size: 1, avg latency of 10 iterations: : 1.5497207641601562e-05 sec Input [Text 0]: " ~~[INST] Hello! [/INST]" Output [Text 0 Beam 0]: "Hello! It's nice to meet you. How can How can" batch_size: 1, avg latency of 10 iterations: : 0.2458343505859375 sec ```~~

Benchmark C++ session w/o Medusa: latency 0.226 sec
``` mpirun -np 2 --allow-run-as-root --oversubscribe \ python run.py --engine_dir mixtral_instruct_v1_trt11 \ --tokenizer_dir mixtral_tokenizer \ --max_output_len=14 \ --temperature 1.0 \ --input_text "[INST] Hello! [/INST]" \ --run_profiling [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061100 [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061100 [06/23/2024-17:36:32] [TRT-LLM] [I] Load engine takes: 18.678980827331543 sec [06/23/2024-17:36:32] [TRT-LLM] [I] Load engine takes: 18.678552389144897 sec batch_size: 1, avg latency of 10 iterations: : 1.5282630920410155e-05 sec Input [Text 0]: " ~~[INST] Hello! [/INST]" Output [Text 0 Beam 0]: "Hello! It's nice to meet you. How can I help" batch_size: 1, avg latency of 10 iterations: : 0.22646000385284423 sec ```~~

The latency of the model without Medusa is lower despite the one with Medusa requires just 5 generation steps rather than 14.

What can be the cause? I'm using 2xA40 (48GB) with Mixtral int8 and TP 2.

NVIDIA / TensorRT-LLM

Medusa with Mixtral 8x7B #1798