Improve tensor allocations in servings

Closes #217.

We want to always allocate tokenization input using binary backend, because it's zero copy, and there is no reason to involve XLA too early.
A new :preallocate_params option that moves params to the device as defined by :defn_options. This can be useful with multiple GPUs, where we could load params into CPU and then use :preallocate_params so each serving partition allocates params on the corresponding device.

elixir-nx / bumblebee