FMInference / FlexLLMGen

Running large language models on a single GPU for throughput-oriented scenarios.
Apache License 2.0
9.18k stars 548 forks source link

Can I use FlaxGen's offloading and compression without caching? #27

Closed yonikremer closed 1 year ago

yonikremer commented 1 year ago

I am researching a method to generate texts with a single call to a decoder only CLM (like BLOOM, OPT, GPT3...) Therefore I will not need to cache. Yet I still want to benefit FlaxGen's offloading and compression. Can I do that? If I can, how can I do that?

yonikremer commented 1 year ago

I also have limited disk space

Ying1123 commented 1 year ago

Sure. You can do that! The generation interface of FlexGen is similar to the generation interface of huggingface, you can just feed your prompts into it. See examples here.

https://github.com/FMInference/FlexGen/blob/0342e2a0e93593b2c11f84be0e9f5d5bcb73e598/apps/chatbot.py#L60-L66