Open Sekri0 opened 1 month ago
Hello, the weights obtained through this way are the calibrated fake quantization weights. To achieve actual weight compression, a packing operation is required to store the weights. For example, using INT2 quantization, the pack operation will use an INT32 to store 16 INT2 element values.
We are developing a complete link from pseudo-quantized models to real packing weights and directly executing WxAy quantized inference in Torch, which is expected to be released within a week after the National Day. We did not release this link before because the engine inside ByteDance is a pure C++ solution.
We are developing a complete link from pseudo-quantized models to real packing weights and directly executing WxAy quantized inference in Torch, which is expected to be released within a week after the National Day. We did not release this link before because the engine inside ByteDance is a pure C++ solution.
Thank you for your work. May I ask when you guys are planning to release the code related to real packing weight and Torch inference
It is expected that this quarter From: @.> Date: Tue, Oct 15, 2024, 18:58 Subject: [External] Re: [bytedance/ABQ-LLM] No reduction in model size (Issue #15) To: @.> Cc: @.>, "Comment"< @.>
We are developing a complete link from pseudo-quantized models to real packing weights and directly executing WxAy quantized inference in Torch, which is expected to be released within a week after the National Day. We did not release this link before because the engine inside ByteDance is a pure C++ solution.
Thank you for your work. May I ask when you guys are planning to release the code related to real packing weight and Torch inference
— Reply to this email directly, view it on GitHub https://github.com/bytedance/ABQ-LLM/issues/15#issuecomment-2413564994, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL4LCKNANSLUL2V2UTNCHHTZ3TYMDAVCNFSM6AAAAABPETUI6SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJTGU3DIOJZGQ . You are receiving this because you commented.Message ID: @.***>
I use this command to quantize llama2-7b-chat model, but the model size dosen't change. CUDA_VISIBLE_DEVICES=0 python3 main.py \ --model /mnt/home/model/llama2-7b-chat-hf \ --epochs 20 --output_dir ./log/llama2-7b-w2a8 \ --eval_ppl --wbits 2 --abits 8 --lwc --let \ --tasks piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande \ --real_quant \ --save_dir /mnt/home/model/abq-llm/llama2-7b-w2a8