FMInference / FlexLLMGen

Running large language models on a single GPU for throughput-oriented scenarios.
Apache License 2.0
9.18k stars 548 forks source link

3090 #4

Closed random452 closed 1 year ago

random452 commented 1 year ago

Hello, I have 3090. How fast can I run Erebus 30B if I will use FlexGen with Compression?

Ying1123 commented 1 year ago

I haven't tested that. But if you want an estimation, 3090 should be slightly better than NVIDIA T4 by 1.5~2x. Note that CPU memory also plays an important role. You may expect a slowdown with a smaller CPU memory. You are welcome to post the statistics if you try it out. :)

random452 commented 1 year ago

Thanks, I will try. 64 gigs is enough for 30B, or I should get 128?

Ying1123 commented 1 year ago

It is better to not be that tight. You will need additional spaces for KV cache. There is an option called --pin-weight. It can make offloading faster when turned on, but which will cause to use CPU memory 2x as the model weights. So if your CPU can only accommodate 1x the model weights, turn off the --pin-weight