genai-impact / ecologits

🌱 EcoLogits tracks the energy consumption and environmental footprint of using generative AI models through APIs.
https://ecologits.ai/
Mozilla Public License 2.0
88 stars 9 forks source link

Quantization bit (or byte) value #84

Closed Neyri closed 1 month ago

Neyri commented 1 month ago

Hi,

Thank you for the great work on this package. While reading through the code and constants, I was wondering if there was an issue with the MODEL_QUANTIZATION_BITS = 4 constants which is then divided by 8 (in model_required_memory function). As I understand it from the doc, it should represent the encoding value of LLM parameters that will be compared with available RAM on GPU. The RAM constant for GPU seem to be in bytes (8 bits). From what I've read, encoding value is usually 32 bits (4 bytes). I think there was a mistake in units there. Can you confirm ?

samuelrince commented 1 month ago

Hello @Neyri

Thanks for the feedback!

The model_required_memory function computes the estimated size in VRAM when the model is loaded on the GPU. We assume models are quantized with 4 bits or int4 (likely not the case for every provider, but it balances out the lack of support for other optimizations on our side).

We base our calculation on this blog post. You are right that usually models are represented with float32 so 4 bytes (32 bits), and in this case, the memory footprint at inference is approximated with:

$$ 4\ \text{bytes} \times \text{(No. Params)} \times 1.2 = \frac{32\ \text{bits}}{8\ \text{bits}} \times \text{(No. Params)} \times 1.2 $$

So for a model quantized at 4 bits (represented with int4), the memory footprint is approximated with:

$$ \frac{4\ \text{bits}}{8\ \text{bits}} \times \text{(No. Params)} \times 1.2 $$

The result is in Gigabytes (GB) because we count the number of parameters in billions.