Aaronhuang-778 / BiLLM

(ICML 2024) BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
https://arxiv.org/abs/2402.04291
MIT License
198 stars 14 forks source link

Issue with code replication #14

Open Devy99 opened 5 months ago

Devy99 commented 5 months ago

Hello, first and foremost, I want to thank you for your incredible work!

I'd like further information on how to reproduce your code. I followed the code instructions in your README, but I am unable to retrieve the quantized models.

The following are the steps I took to replicate your work:

  1. Installing dependencies from the requirements.txt file: To successfully execute the main script, I also needed to add the following dependencies: torch, exceptiongroup, pyparsing, sentencepiece
  2. Modify the run.py script: It appears that the c4 dataset is not available in the current configuration, so I removed it from the list of evaluation datasets.
  3. Running the bash script: I launched the run.sh with the following settings.
    python3 run.py meta-llama/Llama-2-7b-hf wikitext2 braq --blocksize 128 --salient_metric hessian --device "cuda:0" --save

    Differently from the README instructions, I updated the calibration dataset with wikitext2 (because c4 is not accessible) and added the —save option to obtain the quantized model in the output folder.

However, it appears that the final model does not match the expected one. Its size is identical to the original model, and it does not appear to have been quantized.

Did I miss something? Also, should I run the inference using a specific procedure?

Aaronhuang-778 commented 5 months ago

Hi, In this version we only release the fake-quantization to validate the theoretical compression performance border of LLM. And this code does not support the saving of quantized model. authors

Devy99 commented 5 months ago

Thanks for the prompt response! Do you plan to release the quantization pipeline in the near future?