elixir-nx / bumblebee

Pre-trained Neural Network models in Axon (+ 🤗 Models integration)
Apache License 2.0
1.27k stars 90 forks source link

Allow text completion streaming true/false option on a per-call basis #295

Closed brainlid closed 7 months ago

brainlid commented 7 months ago

Currently, the stream: true or stream: false option is set when the serving is created.

Example:

Bumblebee.Text.generation(model_info, tokenizer, generation_config,
  compile: [batch_size: 1, sequence_length: 1028],
  # stream: true,
  stream: false,
  defn_options: [compiler: EXLA, lazy_transfers: :never]
)

If possible, it would be nice to have this option be set on a per-call basis. Some calls will be displayed to the user and streaming is preferred for that.

Other calls may be executed behind the scenes with no UI. An example of this is data extraction or analyzing some text to summarize it or classify text as belonging to 1 of several categories. For these cases, we don't want streaming.

Streaming is sending data between processes and may even be across nodes. We can cut reduce unnecessary chatter by not streaming and instead waiting for the final finished result.

josevalim commented 7 months ago

The model needs to be compiled with the flag because stream may compile a slightly different model. In any case, you could immediately consume the stream if you don’t want to stream it? You could handle it yourself after you call the serving, no?

brainlid commented 7 months ago

Thanks @josevalim. Didn't realize the flag may cause the model to compile differently. That makes sense though. And yes, I can consume the full stream when that's the desired behavior. Thanks!