[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".
Wanda is an unstructured pruning strategy, which currently brings only theoretical storage savings. You can try the quantization algorithms with save_lightllm and utilize our Lightllm for inference.
config: base: seed: &seed 42 model: type: Llama path: /data/Llama-2-7b-hf torch_dtype: auto calib: name: pileval download: True path: calib data path n_samples: 128 bs: -1 seq_len: 512 preproc: general seed: *seed eval: eval_pos: [transformed] name: [wikitext2] download: True path: eval data path bs: 1 seq_len: 2048 sparse: method: Wanda weight: sparsity: 0.5 sparsity_out: True save: save_fp: False save_trans: True save_path: ./save3 原模型: 压缩后: