I am researching a method to generate texts with a single call to a decoder only CLM (like BLOOM, OPT, GPT3...) Therefore I will not need to cache. Yet I still want to benefit FlaxGen's offloading and compression.
Can I do that? If I can, how can I do that?
Sure. You can do that! The generation interface of FlexGen is similar to the generation interface of huggingface, you can just feed your prompts into it. See examples here.
I am researching a method to generate texts with a single call to a decoder only CLM (like BLOOM, OPT, GPT3...) Therefore I will not need to cache. Yet I still want to benefit FlaxGen's offloading and compression. Can I do that? If I can, how can I do that?