求解惑，使用示例的quantize量化方式与使用BitsAndBytesConfig量化有什么区别？

Songjw133 commented 6 months ago

使用baichuan示例的方法量化：

model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan2-7B-Chat", torch_dtype=torch.float16,trust_remote_code=True)
model = model.quantize(4).cuda()

我看了下加载过程，是先加载到内存，很耗费时间，然后量化，再移到显卡。如果电脑内存不够很容易爆内存。

我试了下代码

model = AutoModelForCausalLM.from_pretrained("Baichuan2-7B-Base", quantization_config=BitsAndBytesConfig(load_in_4bit=True), trust_remote_code=True)

直接load_in_4bit也是可以的。而且是直接量化加载到显卡，比第一种方法快得多。

但是为什么这两种量化方法的得到的模型不同？比如，我已经设do_sample=False, temperature=0，但是两种量化方法得到的模型对同样的输入，输出是不同的。而且我看了下显存占用，第二种方法的显存占用比第一种要多了几百MB。这两种量化有什么区别吗，初学者求解惑

bc-gpd commented 6 months ago

对齐一下模型看看，我看您发的一个是Baichuan2-7B-Chat模型，一个是Baichuan2-7B-Base模型

baichuan-assistant commented 6 months ago

我们这边的回复，供参考：“在线量化需要把模型先按照float16加载到cpu，再量化到4bit后送到gpu上。后一种直接把模型加载到gpu上。我们的模型是bfloat16，所以可能结果会有点差异”

baichuan-inc / Baichuan2

求解惑，使用示例的quantize量化方式与使用BitsAndBytesConfig量化有什么区别？ #321