Reduce the output of generation loop when streaming

elixir-nx / bumblebee

Pre-trained Neural Network models in Axon (+ 🤗 Models integration)

Apache License 2.0

1.26k stars 90 forks source link

Reduce the output of generation loop when streaming #337

Closed jonatanklosko closed 4 months ago

jonatanklosko commented 4 months ago

When streaming tokens, we don't care about the generation final output. This PR changes the result to a zeroed tensor with shape {batch_size} (so it can be split by the serving).