elixir-nx / bumblebee

Pre-trained Neural Network models in Axon (+ 🤗 Models integration)
Apache License 2.0
1.26k stars 90 forks source link

Update LLM docs #352

Closed jonatanklosko closed 4 months ago

jonatanklosko commented 4 months ago

I did another iteration of this. Currently running LLaMa 7B with params on the GPU requires 16GiB of memory. Params on the CPU + lazy transfers require 15.12GiB, which is almost negligible and given that it adds latency of like x4 inference time, I think it's no longer worth mentioning. Sidenote: lazy transfers don't really change anything here and that's what I would expect, since generation loops over the model and therefor all params need to be on the GPU. I'm sure how not having params on the GPU makes a difference, since they can't be garbage collected early either, but the difference is very tiny anyway.

Note that for Stable Diffusion params on the CPU + lazy transfers has more impact, because it uses several models, so once one finishes its params can be garbage collected and the next model params can be loaded lazily, so it does make sense.

I also added an example with Mistral.