FMInference / FlexLLMGen

Running large language models on a single GPU for throughput-oriented scenarios.
Apache License 2.0
9.21k stars 549 forks source link

How to split execution of prefill and decode for Flexgen? #138

Open sunchaesk opened 4 months ago

sunchaesk commented 4 months ago

I was wondering what would be the best way to split the execution of flexgen to prefill and decode only.

How should I save the values from prefill and how should I load them when I am running flexgen again for decode only.

Thanks