Closed Neyri closed 1 month ago
Hello @Neyri
Thanks for the feedback!
The model_required_memory
function computes the estimated size in VRAM when the model is loaded on the GPU. We assume models are quantized with 4 bits or int4 (likely not the case for every provider, but it balances out the lack of support for other optimizations on our side).
We base our calculation on this blog post. You are right that usually models are represented with float32 so 4 bytes (32 bits), and in this case, the memory footprint at inference is approximated with:
$$ 4\ \text{bytes} \times \text{(No. Params)} \times 1.2 = \frac{32\ \text{bits}}{8\ \text{bits}} \times \text{(No. Params)} \times 1.2 $$
So for a model quantized at 4 bits (represented with int4), the memory footprint is approximated with:
$$ \frac{4\ \text{bits}}{8\ \text{bits}} \times \text{(No. Params)} \times 1.2 $$
The result is in Gigabytes (GB) because we count the number of parameters in billions.
Hi,
Thank you for the great work on this package. While reading through the code and constants, I was wondering if there was an issue with the
MODEL_QUANTIZATION_BITS = 4
constants which is then divided by 8 (inmodel_required_memory
function). As I understand it from the doc, it should represent the encoding value of LLM parameters that will be compared with available RAM on GPU. The RAM constant for GPU seem to be in bytes (8 bits). From what I've read, encoding value is usually 32 bits (4 bytes). I think there was a mistake in units there. Can you confirm ?