Plans for paper or technical report

apoorvumang / prompt-lookup-decoding

473 stars 23 forks source link

Plans for paper or technical report #3

Open shermansiu opened 10 months ago

shermansiu commented 10 months ago

Apoorv, do you have plans for a paper or a technical report for prompt lookup decoding?

I know you've indicated that people should cite your GitHub repo, but it would be nice to have something out there with more extensive experiments across a variety of datasets, models, model sizes, and hardware types (e.g. CPU/GPU, various types of GPUs). Moreover, it would be nice to have a side-by-side comparison between prompt lookup decoding and other similar methods.

apoorvumang commented 10 months ago

Yes we have something in the works :)

Are there certain experiments/comparisons that you would be most interested in?

shermansiu commented 10 months ago

In terms of methods to compare:

PLD (yours)
EAGLE
LADE
Medusa
Speculative Decoding

Metrics to compare:

Compression ratio
Impact on VRAM (I hate OOMs. PLD is VRAM-friendly)
Obviously the real-time (as opposed to user time) speedup, because that's what users will experience. Obviously varies with user-to-user hardware differences, but we need to show that the speedup is close to the compression ratio.
Whether quantization affects performance (e.g. OPTQ, AWQ)
Impact on creative writing tasks with temp~0.7-0.8, where there might be fewer repeating n-grams in the prompt
Latency impact on long prompts, which are already bottlenecked by the quadratic FLOPs complexity of attention

Hardware:

CPU (e.g. using the llama.cpp implementation of PLD)
Consumer GPUs (e.g RTX ~2060-4090)
Obviously (A|H)100s if you have some

apoorvumang commented 10 months ago

Thanks! We will definitely look into some of these

shermansiu commented 10 months ago

And FYI, LADE actually achieves a speed slowdown under the default settings on a RTX 3090 and the LADE parameters need to be adjusted to be less intense to get a mild speedup.

shermansiu commented 10 months ago

(The LADE authors know about this, as it was brought up by Joao Gante from the Huggingface staff and independently by another user on their GitHub repo)

iofu728 commented 3 weeks ago

Thank you for your excellent idea. However, I'd like to kindly point out that this concept may be very similar to 'Aggressive Decoding' https://arxiv.org/pdf/2106.04970.