TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
Apache License 2.0
7.61k stars 829 forks source link

Error occured when running medusa inference. #1575

Open littletomatodonkey opened 2 months ago

littletomatodonkey commented 2 months ago

Hi, when i use medusa decoding on trtllm-090 which profiling, error occrued as follows. Could you please help to have a look? Thanks!

If i do not use --run_profiling, the inference process is normal.

  File "/opt/tiger/miniconda3/lib/python3.10/site-packages/tensorrt_llm/runtime/generation.py", line 2431, in handle_per_step
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[05/10/2024-21:45:33] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::55] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:345] Error 700 destroying stream '0x56052ab0a0f0'.)
[05/10/2024-21:45:33] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::55] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:345] Error 700 destroying stream '0x56052ab69e70'.)
[05/10/2024-21:45:33] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::55] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:345] Error 700 destroying stream '0x56052abb95f0'.)

tmp_dir=$(mktemp -d)


python convert_checkpoint.py \
--model_dir "${model_dir}" \
--medusa_model_dir "${medusa_model_dir}" \
--output_dir "${tmp_dir}" \
--dtype float16 \
--fixed_num_medusa_heads 4

trtllm-build \
--checkpoint_dir ${tmp_dir} \
--output_dir ${trt_model_dir} \
--gemm_plugin float16 \
--remove_input_padding "enable" \
--context_fmha "enable" \
--gemm_plugin="float16" \
--gpt_attention_plugin "float16" \
--max_batch_size 16 \
--max_input_len 4096 \
--max_output_len 1024 \
--paged_kv_cache enable \
--use_paged_context_fmha enable

cp -r ${model_dir}/*token* ${output_dir}/


python ../run.py \
--engine_dir ${trt_model_dir} \
--tokenizer_dir ${trt_model_dir} \
--max_output_len=100 \
--medusa_choices="[[0], [0, 0], [1], [0, 1], [2], [0, 0, 0], [1, 0], [0, 2], [3], [0, 3], [4], [0, 4], [2, 0], [0, 5], [0, 0, 1], [5], [0, 6], [6], [0, 7], [0, 1, 0], [1, 1], [7], [0, 8], [0, 0, 2], [3, 0], [0, 9], [8], [9], [1, 0, 0], [0, 2, 0], [1, 2], [0, 0, 3], [4, 0], [2, 1], [0, 0, 4], [0, 0, 5], [0, 0, 0, 0], [0, 1, 1], [0, 0, 6], [0, 3, 0], [5, 0], [1, 3], [0, 0, 7], [0, 0, 8], [0, 0, 9], [6, 0], [0, 4, 0], [1, 4], [7, 0], [0, 1, 2], [2, 0, 0], [3, 1], [2, 2], [8, 0], [0, 5, 0], [1, 5], [1, 0, 1], [0, 2, 1], [9, 0], [0, 6, 0], [0, 0, 0, 1], [1, 6], [0, 7, 0]]" \
--use_py_session \
--temperature 1.0 \
--input_text "Once upon" \
dongxuy04 commented 2 months ago

Hi, I tried with latest main and it seems OK, could you please try with that? Thanks! BTW, with latest main, C++ runtime can also be used by removing --use_py_session.

nv-guomingz commented 1 month ago

Hi @littletomatodonkey , do u still encouter such issue w/ @dongxuy04 's suggestion? If not, I'll close this ticket.