[bug] Medusa example fails with vicuna 33B

SoundProvider commented 1 day ago

Thank you for developing trt-llm. It's helping me a lot I'm trying to use medusa with trt-llm, referencing this page

It's working fine with vicuna 7B and its medusa heads, with no errors at all.

However, when implementing with vicuna 33B and its trained heads, the following error occurs when executing trtllm-build converting checkpoint with medusa was done with following result

## running script
CUDA_VISIBLE_DEVICES=${DEVICES} \
trtllm-build --checkpoint_dir /app/medusa_test/tensorrt/${TP_SIZE}-gpu \
             --gpt_attention_plugin float16 \
             --gemm_plugin float16 \
             --context_fmha enable \
             --output_dir /app/medusa_test/tensorrt_llm/${TP_SIZE}-gpu \
             --speculative_decoding_mode medusa \
             --max_batch_size ${BATCH_SIZE} \
             --max_input_len ${SEQ_LEN} \
             --max_seq_len ${SEQ_LEN} \
             --max_num_tokens ${SEQ_LEN} \
             --workers ${TP_SIZE}

concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/usr/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'MedusaConfig.__init__.<locals>.GenericMedusaConfig'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 437, in parallel_build
    future.result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/usr/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'MedusaConfig.__init__.<locals>.GenericMedusaConfig'

hello-11 commented 3 hours ago

@SoundProvider, could you also show the command to convert the checkpoint?

SoundProvider commented 1 hour ago

DEVICES=0,1,2,3
TP_SIZE=4
BATCH_SIZE=4

CUDA_VISIBLE_DEVICES=${DEVICES} \
python /app/tensorrt_llm/examples/medusa/convert_checkpoint.py \
                            --model_dir /app/models/vicuna-33b-v1.3 \
                            --medusa_model_dir /app/models/medusa-vicuna-33b-v1.3 \
                            --output_dir /app/models/medusa_test/tensorrt/${TP_SIZE}-gpu \
                            --dtype float16 \
                            --num_medusa_heads 4 \
                            --tp_size ${TP_SIZE} 

CUDA_VISIBLE_DEVICES=${DEVICES} \
trtllm-build --checkpoint_dir /app/models/medusa_test/tensorrt/${TP_SIZE}-gpu \
             --gpt_attention_plugin float16 \
             --gemm_plugin float16 \
             --context_fmha enable \
             --output_dir /app/models/medusa_test/tensorrt_llm/${TP_SIZE}-gpu \
             --speculative_decoding_mode medusa \
             --max_batch_size ${BATCH_SIZE} \
             --workers ${TP_SIZE}

@hello-11 I use the medusa example here.

NVIDIA / TensorRT-LLM

[bug] Medusa example fails with vicuna 33B #2478