How to split execution of prefill and decode for Flexgen?

FMInference / FlexLLMGen

Running large language models on a single GPU for throughput-oriented scenarios.

Apache License 2.0

9.21k stars 549 forks source link

Open sunchaesk opened 4 months ago

sunchaesk commented 4 months ago

I was wondering what would be the best way to split the execution of flexgen to prefill and decode only.

How should I save the values from prefill and how should I load them when I am running flexgen again for decode only.

Thanks