Batched speculation benchmarks? (incl. with compilation)

poedator commented 2 months ago

Hello, and thank you for pushing the boundary on speculative generation!

Question 1: In Eagle-1 paper, table 7 reports throughput figures for Vicuna-7B at 1.97x. How exactly was this measured? (GPU, batch size, testing code). How would this be different if generating with temperature==1 ?

Question 2: Is Eagle-2 compatible with batch generation? Specifically including tree attention and temperature == 1? If so, please share the code example like in Eagle-1 and the benchmarks like in Eagle-1 paper section4.4, table 7.

Question 3: Were there any tests of Eagle/Eagle2 in setups with compilation or some faster frameworks, or with TensorRT, how would speedup figures change vs. the ones reported in the papers? Especially in the batched setup

Liyuhui-12 commented 2 months ago

Question 1: In Eagle-1 paper, table 7 reports throughput figures for Vicuna-7B at 1.97x. How exactly was this measured? (GPU, batch size, testing code). How would this be different if generating with temperature==1 ?

The specific settings can be found in Section 4.4 of our paper, and the code is on the v1 branch. As with other speculative sampling methods, the performance of temperature=1 will be slightly worse than temperature=0.

Question 2: Is Eagle-2 compatible with batch generation? Specifically including tree attention and temperature == 1? If so, please share the code example like in Eagle-1 and the benchmarks like in Eagle-1 paper section4.4, table 7.

EAGLE-2 currently does not support batch generation.

Question 3: Were there any tests of Eagle/Eagle2 in setups with compilation or some faster frameworks, or with TensorRT, how would speedup figures change vs. the ones reported in the papers? Especially in the batched setup

Integration with other frameworks is a significant amount of work, and it is part of our future plans.

Siegfried-qgf commented 2 months ago

Hello，I wanna ask if EAGLE2 method can generate in batches. Is this method itself not suitable for batch generation？

SafeAILab / EAGLE

Batched speculation benchmarks? (incl. with compilation) #126