Huge difference in llama2 parameter count after 4bit loading

bitsandbytes-foundation / bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.

https://huggingface.co/docs/bitsandbytes/main/en/index

MIT License

6.12k stars 611 forks source link

Huge difference in llama2 parameter count after 4bit loading #834

Closed akshayiyer2610 closed 9 months ago

akshayiyer2610 commented 11 months ago

After loading the llama2-7b-text model using 4-bit quantization, the total parameter count is reduced to ~3.5B. Is this a bug or the expected behavior.

Packages: bitsandbytes => 0.41.1 transformers => 4.33.2 torch => 2.0.1

Code:

bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    )
model_name = "local_llama2/llama2-7b-hf" # this contains the llama2 model downloaded from huggingface; sitting on my local machine

model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto"
    )

all_param = 0
for _, param in model.named_parameters():
    num_params = param.numel()
    all_param += num_params

print(f"all params: {all_param}")

The output of print statement is 3,540,389,888.

NPap0 commented 11 months ago

@akshayiyer2610 In the source code, when picking 4-bit, the parameter count is divided by 2. So this is intended. (I don't know why though, I just remember it behaved that way)

akshayiyer2610 commented 11 months ago

@OneCodeToRuleThemAll Can you point me to the source where its divided by 2? Does that imply the other ~3.5B parameters are frozen during the quantization process?

NPap0 commented 11 months ago

@OneCodeToRuleThemAll Can you point me to the source where its divided by 2? Does that imply the other ~3.5B parameters are frozen during the quantization process?

@akshayiyer2610 okay so I had to do a little digging to find the code snippet again and from qlora repo

https://github.com/artidoro/qlora/blob/7f4e95a68dc076bea9b3a413d2b512eca6d004e5/qlora.py#L408C1-L423C6

And here is an issue with the same question from that repo (Remains unanswered) https://github.com/artidoro/qlora/issues/260

akshayiyer2610 commented 11 months ago

Thanks for the link to source code @OneCodeToRuleThemAll . Much appreciated.

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

arvindpdmn commented 4 months ago

Bot closed this issue even when it hasn't been resolved. Quantization reduces memory requirements but has no effect on parameter count. I see this as a bu.