OpenGVLab / OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
MIT License
663 stars 50 forks source link

‼️Llama2-70b not working #16

Closed zhiwei-dong closed 11 months ago

zhiwei-dong commented 11 months ago

There seems to be an issue with the OmniQuant Llama2-70b model. The problem arises from the mismatch of scales shape and weight in the GQA algorithm.

ChenMnZ commented 11 months ago

Yeah, we have not yet supported LET for LLaMA-2-70B.

zhiwei-dong commented 11 months ago

Get

zhiwei-dong commented 11 months ago
Screenshot 2023-09-24 at 23 51 36

Here's another question, I tried to turn off some smooth, but it seems that multi-card is not taken into account when quantizing 70b models

ChenMnZ commented 11 months ago

Thanks for your feedback. I have fixed this bug.

zhiwei-dong commented 11 months ago

get

hsb1995 commented 4 months ago

I set batchsize = 1

hsb1995 commented 4 months ago

image image Why is 40GB of graphics memory used during the training phase. But in the end, is it just consuming such a large amount of graphics memory?

hsb1995 commented 4 months ago

Why did I use A800/80G to run the compression of llama65B, which resulted in overflow of graphics memory? As mentioned in the paper, 40GB of graphics memory is enough to compress 65B? @zhiwei-dong @ChenMnZ