[quantizer] add Odyssey-style symmetric quantization

xingchensong commented 6 months ago

What does this PR do?

Implement an OdysseyLLM-style symmetric quantization which disables the zero_point, offering greater hardware efficiency compared to the current version.

ref: https://arxiv.org/pdf/2311.09550v1.pdf

current version:

Odyssey version:

Benchmark (W4A8, W per-channel, A per-token)

Model: Llama-2-7b-chat

calibration dataset	PPL (wiki2)	PPL (ptb)	PPL (c4)	additional args
NONE (fp16)	7.076	28.138	-	-
wiki2	7.456	51.077	-	-
ptb	7.638	30.797	9.648	-
mix (wiki2 + ptb + c4)	7.485	33.096	9.487	-
mix (wiki2 + ptb + c4)	7.575	33.673	9.550	--symmetric
mix (wiki2 + ptb + c4)	7.577	32.644	9.522	--symmetric --disable_zero_point

Reproduce

# https://github.com/OpenGVLab/OmniQuant/issues/37
CUDA_VISIBLE_DEVICES=0 python main.py \
  --model /jfs-hdfs/user/xingchen.song/share/LLM/Llama-2-7b-chat --eval_ppl \
  --epochs 60 --output_dir ./log/Llama-2-7b-chat-w4a8-ep60-mix-sym-odyssey \
  --wbits 4 --abits 8 --lwc --aug_loss --deactive_amp \
  --let --let_lr 1e-3 --alpha 0.75 \
  --calib_dataset mix --symmetric --disable_zero_point

xingchensong commented 6 months ago

cc @ChenMnZ . BTW, I am eager to replicate OdysseyLLM (Omniquant + GPTQ, refer to sections 5.1 & 5.2) using this repository and will submit a PR upon completion.

ChenMnZ commented 6 months ago

@xingchensong Thanks for your contribution about the symmetric quantization.

And also looking forward to your reproduction about OdysseyLLM.

OpenGVLab / OmniQuant