elixir-nx / bumblebee

Pre-trained Neural Network models in Axon (+ 🤗 Models integration)
Apache License 2.0
1.33k stars 96 forks source link

Improve tensor allocations in servings #233

Closed jonatanklosko closed 1 year ago

jonatanklosko commented 1 year ago

Closes #217.

  1. We want to always allocate tokenization input using binary backend, because it's zero copy, and there is no reason to involve XLA too early.

  2. A new :preallocate_params option that moves params to the device as defined by :defn_options. This can be useful with multiple GPUs, where we could load params into CPU and then use :preallocate_params so each serving partition allocates params on the corresponding device.