Open SoundProvider opened 7 hours ago
@SoundProvider could you tell me the method of your performance evaluations?
@hello-11 hello. I used the run script in the medusa example folder
python /app/tensorrt_llm/examples/run.py --engine_dir /app/models/medusa_test_3b/tensorrt_llm/4-gpu \
--tokenizer_dir /app/models/vicuna-33b-v1.3 \
--max_output_len=500 \
--medusa_choices="[[0], [0, 0], [1], [0, 1], [2], [0, 0, 0], [1, 0], [0, 2], [3], [0, 3], [4], [0, 4], [2, 0], [0, 5], [0, 0, 1], [5], [0, 6], [6], [0, 7], [0, 1, 0], [1, 1], [7], [0, 8], [0, 0, 2], [3, 0], [0, 9], [8], [9], [1, 0, 0], [0, 2, 0], [1, 2], [0, 0, 3], [4, 0], [2, 1], [0, 0, 4], [0, 0, 5], [0, 0, 0, 0], [0, 1, 1], [0, 0, 6], [0, 3, 0], [5, 0], [1, 3], [0, 0, 7], [0, 0, 8], [0, 0, 9], [6, 0], [0, 4, 0], [1, 4], [7, 0], [0, 1, 2], [2, 0, 0], [3, 1], [2, 2], [8, 0], [0, 5, 0], [1, 5], [1, 0, 1], [0, 2, 1], [9, 0], [0, 6, 0], [0, 0, 0, 1], [1, 6], [0, 7, 0]]" \
--temperature 1.0 \
--input_text "Once upon" \
--run_profiling
I'm trying to use medusa with trt-llm, referencing this page
It's working fine with vicuna 7B and its medusa heads, as reference in the example page.
In the example, it's stated that
Note: Increasing the batch size may have a negative impact on performance
My understanding is that, when the batch size increases, each sequence should wait for the other sequences to reach its position, resulting performance degradation.But when I tested with vicuna 7B, the performance still dropped with 4 batch, each sequence using the same input. This is contradicting from my understanding.
I tested batch size variation with same inputs(4batch with same inputs)
What would be the reason?? It would be really nice if someone could explain.
Thank you