FMInference / FlexiGen

Running large language models on a single GPU for throughput-oriented scenarios.
Apache License 2.0
9.14k stars 540 forks source link

Please do not abandon this project! #126

Open oobabooga opened 10 months ago

oobabooga commented 10 months ago

Earlier this year I was impressed with the offloading performance of FlexGen, and I wonder how it would compare with the performance currently provided by llama.cpp for Llama and Llama-2 models in a CPU offloading scenario.

Any chance Llama support could be added to FlexGen @Ying1123 @keroro824?

BinhangYuan commented 10 months ago

We are pushing a refactoring of the current implementation to support most HF models, we will release that soon under a fork of this repo and will keep you informed.

oobabooga commented 10 months ago

That's exciting news @BinhangYuan! I look forward to testing the new release and incorporating it in my text-generation-webui project. Cheers :)

arnfaldur commented 4 months ago

Are there any news of this fork?