bytedance / ABQ-LLM

An acceleration library that supports arbitrary bit-width combinatorial quantization operations
Apache License 2.0
226 stars 25 forks source link

No reduction in model size #15

Open Sekri0 opened 1 month ago

Sekri0 commented 1 month ago

I use this command to quantize llama2-7b-chat model, but the model size dosen't change. CUDA_VISIBLE_DEVICES=0 python3 main.py \ --model /mnt/home/model/llama2-7b-chat-hf \ --epochs 20 --output_dir ./log/llama2-7b-w2a8 \ --eval_ppl --wbits 2 --abits 8 --lwc --let \ --tasks piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande \ --real_quant \ --save_dir /mnt/home/model/abq-llm/llama2-7b-w2a8

zengchao0424 commented 1 month ago

Hello, the weights obtained through this way are the calibrated fake quantization weights. To achieve actual weight compression, a packing operation is required to store the weights. For example, using INT2 quantization, the pack operation will use an INT32 to store 16 INT2 element values.

lswzjuer commented 1 month ago

We are developing a complete link from pseudo-quantized models to real packing weights and directly executing WxAy quantized inference in Torch, which is expected to be released within a week after the National Day. We did not release this link before because the engine inside ByteDance is a pure C++ solution.

Sekri0 commented 1 month ago

We are developing a complete link from pseudo-quantized models to real packing weights and directly executing WxAy quantized inference in Torch, which is expected to be released within a week after the National Day. We did not release this link before because the engine inside ByteDance is a pure C++ solution.

Thank you for your work. May I ask when you guys are planning to release the code related to real packing weight and Torch inference

lswzjuer commented 1 month ago

It is expected that this quarter From: @.> Date: Tue, Oct 15, 2024, 18:58 Subject: [External] Re: [bytedance/ABQ-LLM] No reduction in model size (Issue #15) To: @.> Cc: @.>, "Comment"< @.>

We are developing a complete link from pseudo-quantized models to real packing weights and directly executing WxAy quantized inference in Torch, which is expected to be released within a week after the National Day. We did not release this link before because the engine inside ByteDance is a pure C++ solution.

Thank you for your work. May I ask when you guys are planning to release the code related to real packing weight and Torch inference

— Reply to this email directly, view it on GitHub https://github.com/bytedance/ABQ-LLM/issues/15#issuecomment-2413564994, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL4LCKNANSLUL2V2UTNCHHTZ3TYMDAVCNFSM6AAAAABPETUI6SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJTGU3DIOJZGQ . You are receiving this because you commented.Message ID: @.***>