关于量化模型推理

bytedance / decoupleQ

A quantization algorithm for LLM

Apache License 2.0

94 stars 5 forks source link

关于量化模型推理 #1

Open ChuanhongLi opened 3 months ago

ChuanhongLi commented 3 months ago

非常赞的工作！有两个问题问题请教一下： 1）使用 decoupleQ 量化后的LLaMA模型，推理性能如何？是否有对应一些测试数据？ 2）量化后的模型如何部署进行推理呢? README 中，有看到 NVIDIA/TensorRT-LLM#1568，是直接使用 TensorRT-LLM 进行部署吗？然后请问有对应的推理部署脚本吗？

谢谢！

gavinchen430 commented 3 months ago

我们正在写一些example，包括如何产出量化模型，编译TensorRT-LLM以及如何使用w2a16的kernel替换torch的bf16/fp16的kernel进行推理。后续我们会尽快开源到这个仓库。

ChuanhongLi commented 3 months ago

我们正在写一些example，包括如何产出量化模型，编译TensorRT-LLM以及如何使用w2a16的kernel替换torch的bf16/fp16的kernel进行推理。后续我们会尽快开源到这个仓库。

感谢回复！顺便问下，true_quant.pth 保存的是量化后的权重吗？能否通过直接替换到原始model 对应的权重信息，通过transformers 库进行推理，简单看下模型输出？还是说必须得通过 tensorrt-llm 进行推理？

ChuanhongLi commented 3 months ago

我们对 Llama-2-7b-hf 进行了量化， --wbits 2，有点疑问的是，保存下来的 true_quant.pth 有 6.6G，这正常吗？然后我们打印了下 true_quant.pth 的数据，数据类型有的为 dtype=torch.int8，有的为 torch.float16

run_llama.sh
python3 llama.py LLaMA/Llama-2-7b-hf/ c4 --true-sequential --act-order --new-eval \
--wbits 2 \
--group-size -1 \
--nsamples 128 \
--max-iter-num 4 \
--iters-before-round 200 \
--inner-iters-for-round 5 \
--blockwise-minimize-epoch 4 \
--round-fn gptq \
--blockwise-minimize-lr 1.0e-5 \
--train-LN \
--save

GuoYi0 commented 3 months ago

@ChuanhongLi 保存的true_quant.pth，主weights确实是int8，其取值范围是{-2,-1,0,1}；scale 和zero等等，是fp16

ChuanhongLi commented 3 months ago

我们正在写一些example，包括如何产出量化模型，编译TensorRT-LLM以及如何使用w2a16的kernel替换torch的bf16/fp16的kernel进行推理。后续我们会尽快开源到这个仓库。

感谢回复！顺便问下，true_quant.pth 保存的是量化后的权重吗？能否通过直接替换到原始model 对应的权重信息，通过transformers 库进行推理，简单看下模型输出？还是说必须得通过 tensorrt-llm 进行推理？

@GuoYi0 @gavinchen430 这个可行吗？我们想看下量化后的推理效果

gavinchen430 commented 3 months ago

true_quant.pth 是用int8保存的int2的数据，我们在推理时，会将int8的数据pack成int2的数据，所以实际gpu显存是降低四倍的，现在用8bit保存模型的目的主要是方便调试和对齐，没有什么其他特别的作用。所以如果感觉模型比较大，可以将这个pack的过程在导出的时候就做完。

true_quant.pth 是无法使用transformers库进行推理的，因为现在2bit的weight-only的算子，我们主要在trtllm里面支持的，在https://github.com/bytedance/decoupleQ/pull/2/ 这个PR里面，展示了如何简单的使用trtllm的kernel替换torch原生的gemm来推理2bit的模型。

如果只是想验证精度，你可以使用transformers推理fake_quant.pth，这个模型是true_quant.pth 反量化后得到的fp16的模型，所以数值上是等价的。

chuangzhidan commented 3 months ago

我们正在写一些example，包括如何产出量化模型，编译TensorRT-LLM以及如何使用w2a16的kernel替换torch的bf16/fp16的kernel进行推理。后续我们会尽快开源到这个仓库。

pip3 install datasets==1.17.0

python llama.py /media/data/xgp/model/Unichat-llama3-Chinese-8B-28K c4 --true-sequential --act-order --new-eval \ --wbits 2 \ --group-size -1 \ --nsamples 128 \ --max-iter-num 4 \ --iters-before-round 200 \ --inner-iters-for-round 5 \ --blockwise-minimize-epoch 4 \ --round-fn gptq \ --blockwise-minimize-lr 1.0e-5 \ --train-LN \ --save

运行报错如下，想问下是什么原因？文件路径？ WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv usage: llama.py [-h] [--seed SEED] [--nsamples NSAMPLES] [--percdamp PERCDAMP] [--nearest] [--wbits {2,3,4,8,16}] [--group-size GROUP_SIZE] [--sym] [--save] [--new-eval] [--act-order] [--true-sequential] [--static-groups] [--quant-method {optq,moq,moq_sequential,}] [--loss-thr LOSS_THR] [--max-iter-num MAX_ITER_NUM] [--inner-iters-for-round INNER_ITERS_FOR_ROUND] [--iters-before-round ITERS_BEFORE_ROUND] [--lr LR] [--round-fn {gptq,train}] [--blockwise-minimize-lr BLOCKWISE_MINIMIZE_LR] [--blockwise-minimize-wd BLOCKWISE_MINIMIZE_WD] [--blockwise-minimize-epoch BLOCKWISE_MINIMIZE_EPOCH] [--train-LN] [--train-bias] model {wikitext2,ptb,c4} llama.py: error: unrecognized arguments:
run_llama.sh: line 4: --wbits: command not found run_llama.sh: line 5: --group-size: command not found run_llama.sh: line 6: --nsamples: command not found run_llama.sh: line 7: --max-iter-num: command not found run_llama.sh: line 8: --iters-before-round: command not found run_llama.sh: line 9: --inner-iters-for-round: command not found run_llama.sh: line 10: --blockwise-minimize-epoch: command not found run_llama.sh: line 11: --round-fn: command not found run_llama.sh: line 12: --blockwise-minimize-lr: command not found run_llama.sh: line 13: --train-LN: command not found run_llama.sh: line 14: --save: command not found

GuoYi0 commented 3 months ago

我们正在写一些example，包括如何产出量化模型，编译TensorRT-LLM以及如何使用w2a16的kernel替换torch的bf16/fp16的kernel进行推理。后续我们会尽快开源到这个仓库。

pip3 install datasets==1.17.0

python llama.py /media/data/xgp/model/Unichat-llama3-Chinese-8B-28K c4 --true-sequential --act-order --new-eval --wbits 2 --group-size -1 --nsamples 128 --max-iter-num 4 --iters-before-round 200 --inner-iters-for-round 5 --blockwise-minimize-epoch 4 --round-fn gptq --blockwise-minimize-lr 1.0e-5 --train-LN --save

运行报错如下，想问下是什么原因？文件路径？ WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv usage: llama.py [-h] [--seed SEED] [--nsamples NSAMPLES] [--percdamp PERCDAMP] [--nearest] [--wbits {2,3,4,8,16}] [--group-size GROUP_SIZE] [--sym] [--save] [--new-eval] [--act-order] [--true-sequential] [--static-groups] [--quant-method {optq,moq,moq_sequential,}] [--loss-thr LOSS_THR] [--max-iter-num MAX_ITER_NUM] [--inner-iters-for-round INNER_ITERS_FOR_ROUND] [--iters-before-round ITERS_BEFORE_ROUND] [--lr LR] [--round-fn {gptq,train}] [--blockwise-minimize-lr BLOCKWISE_MINIMIZE_LR] [--blockwise-minimize-wd BLOCKWISE_MINIMIZE_WD] [--blockwise-minimize-epoch BLOCKWISE_MINIMIZE_EPOCH] [--train-LN] [--train-bias] model {wikitext2,ptb,c4} llama.py: error: unrecognized arguments: run_llama.sh: line 4: --wbits: command not found run_llama.sh: line 5: --group-size: command not found run_llama.sh: line 6: --nsamples: command not found run_llama.sh: line 7: --max-iter-num: command not found run_llama.sh: line 8: --iters-before-round: command not found run_llama.sh: line 9: --inner-iters-for-round: command not found run_llama.sh: line 10: --blockwise-minimize-epoch: command not found run_llama.sh: line 11: --round-fn: command not found run_llama.sh: line 12: --blockwise-minimize-lr: command not found run_llama.sh: line 13: --train-LN: command not found run_llama.sh: line 14: --save: command not found

每一行命令后面要加空格和反斜杠\

chuangzhidan commented 3 months ago

我们正在写一些example，包括如何产出量化模型，编译TensorRT-LLM以及如何使用w2a16的kernel替换torch的bf16/fp16的kernel进行推理。后续我们会尽快开源到这个仓库。

pip3 install datasets==1.17.0 python llama.py /media/data/xgp/model/Unichat-llama3-Chinese-8B-28K c4 --true-sequential --act-order --new-eval --wbits 2 --group-size -1 --nsamples 128 --max-iter-num 4 --iters-before-round 200 --inner-iters-for-round 5 --blockwise-minimize-epoch 4 --round-fn gptq --blockwise-minimize-lr 1.0e-5 --train-LN --save 运行报错如下，想问下是什么原因？文件路径？ WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv usage: llama.py [-h] [--seed SEED] [--nsamples NSAMPLES] [--percdamp PERCDAMP] [--nearest] [--wbits {2,3,4,8,16}] [--group-size GROUP_SIZE] [--sym] [--save] [--new-eval] [--act-order] [--true-sequential] [--static-groups] [--quant-method {optq,moq,moq_sequential,}] [--loss-thr LOSS_THR] [--max-iter-num MAX_ITER_NUM] [--inner-iters-for-round INNER_ITERS_FOR_ROUND] [--iters-before-round ITERS_BEFORE_ROUND] [--lr LR] [--round-fn {gptq,train}] [--blockwise-minimize-lr BLOCKWISE_MINIMIZE_LR] [--blockwise-minimize-wd BLOCKWISE_MINIMIZE_WD] [--blockwise-minimize-epoch BLOCKWISE_MINIMIZE_EPOCH] [--train-LN] [--train-bias] model {wikitext2,ptb,c4} llama.py: error: unrecognized arguments: run_llama.sh: line 4: --wbits: command not found run_llama.sh: line 5: --group-size: command not found run_llama.sh: line 6: --nsamples: command not found run_llama.sh: line 7: --max-iter-num: command not found run_llama.sh: line 8: --iters-before-round: command not found run_llama.sh: line 9: --inner-iters-for-round: command not found run_llama.sh: line 10: --blockwise-minimize-epoch: command not found run_llama.sh: line 11: --round-fn: command not found run_llama.sh: line 12: --blockwise-minimize-lr: command not found run_llama.sh: line 13: --train-LN: command not found run_llama.sh: line 14: --save: command not found

每一行命令后面要加空格和反斜杠\

是地址的原因，解决了，但是量化了大半天，最后遇到一个错误： ······ time cost for block minimization: 99.12145829200745 quant layer 31 done! time cost 298.48425579071045 （这个是分钟的计算单位吗）

The quantization duration is 2.627319086591403（这个是小时的计算单位吗） Downloading: 8.48kB [00:00, 9.63MB/s]
Downloading: 6.84kB [00:00, 15.4kB/s]
Downloading and preparing dataset wikitext/wikitext-2-raw-v1 (download: 4.50 MiB, generated: 12.90 MiB, post-processed: Unknown size, total: 17.40 MiB) to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126... Downloading: 243B [00:00, 282kB/s] Traceback (most recent call last): File "/workspace/decoupleQ/llama.py", line 427, in dataloader, testloader = get_loaders( File "/workspace/decoupleQ/datautils.py", line 206, in get_loaders return get_wikitext2(nsamples, seed, seqlen, model) File "/workspace/decoupleQ/datautils.py", line 22, in get_wikitext2 traindata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train') File "/opt/conda/lib/python3.10/site-packages/datasets/load.py", line 1694, in load_dataset builder_instance.download_and_prepare( File "/opt/conda/lib/python3.10/site-packages/datasets/builder.py", line 595, in download_and_prepare self._download_and_prepare( File "/opt/conda/lib/python3.10/site-packages/datasets/builder.py", line 665, in _download_and_prepare verify_checksums( File "/opt/conda/lib/python3.10/site-packages/datasets/utils/info_utils.py", line 40, in verify_checksums raise NonMatchingChecksumError(error_msg + str(bad_urls)) datasets.utils.info_utils.NonMatchingChecksumError: Checksums didn't match for dataset source files: ['https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip']

chuangzhidan commented 3 months ago

如果感觉模型比较大，可以将这个pack的过程在导出的时候就做完。

true_quant.pth 是原模型的二分之一大小，导出pack是不是只有truequant的四分之一、原模型的八分之一了？怎么在导出的时候做哈。^^

gavinchen430 commented 3 months ago

https://github.com/bytedance/decoupleQ/blob/6fe5a2196512eae2634e58cf0c5ff5dd2949e5fc/csrc/w2a16.cu#L155-L177

可以看下这个函数的pack的过程，意思就是现在的模型是int8的数据类型保存的是int2的数据，这里的pack就是把4个int8 变成一个int8，通过位运算来实现。如果你离线pack，部署的时候需要把w2a16.cu这里的L155-L177行去掉，可能还需要适当改造下。

GuoYi0 commented 3 months ago

我们正在写一些example，包括如何产出量化模型，编译TensorRT-LLM以及如何使用w2a16的kernel替换torch的bf16/fp16的kernel进行推理。后续我们会尽快开源到这个仓库。

pip3 install datasets==1.17.0 python llama.py /media/data/xgp/model/Unichat-llama3-Chinese-8B-28K c4 --true-sequential --act-order --new-eval --wbits 2 --group-size -1 --nsamples 128 --max-iter-num 4 --iters-before-round 200 --inner-iters-for-round 5 --blockwise-minimize-epoch 4 --round-fn gptq --blockwise-minimize-lr 1.0e-5 --train-LN --save 运行报错如下，想问下是什么原因？文件路径？ WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv usage: llama.py [-h] [--seed SEED] [--nsamples NSAMPLES] [--percdamp PERCDAMP] [--nearest] [--wbits {2,3,4,8,16}] [--group-size GROUP_SIZE] [--sym] [--save] [--new-eval] [--act-order] [--true-sequential] [--static-groups] [--quant-method {optq,moq,moq_sequential,}] [--loss-thr LOSS_THR] [--max-iter-num MAX_ITER_NUM] [--inner-iters-for-round INNER_ITERS_FOR_ROUND] [--iters-before-round ITERS_BEFORE_ROUND] [--lr LR] [--round-fn {gptq,train}] [--blockwise-minimize-lr BLOCKWISE_MINIMIZE_LR] [--blockwise-minimize-wd BLOCKWISE_MINIMIZE_WD] [--blockwise-minimize-epoch BLOCKWISE_MINIMIZE_EPOCH] [--train-LN] [--train-bias] model {wikitext2,ptb,c4} llama.py: error: unrecognized arguments: run_llama.sh: line 4: --wbits: command not found run_llama.sh: line 5: --group-size: command not found run_llama.sh: line 6: --nsamples: command not found run_llama.sh: line 7: --max-iter-num: command not found run_llama.sh: line 8: --iters-before-round: command not found run_llama.sh: line 9: --inner-iters-for-round: command not found run_llama.sh: line 10: --blockwise-minimize-epoch: command not found run_llama.sh: line 11: --round-fn: command not found run_llama.sh: line 12: --blockwise-minimize-lr: command not found run_llama.sh: line 13: --train-LN: command not found run_llama.sh: line 14: --save: command not found

每一行命令后面要加空格和反斜杠\

是地址的原因，解决了，但是量化了大半天，最后遇到一个错误： ······ time cost for block minimization: 99.12145829200745 quant layer 31 done! time cost 298.48425579071045 （这个是分钟的计算单位吗）

The quantization duration is 2.627319086591403（这个是小时的计算单位吗） Downloading: 8.48kB [00:00, 9.63MB/s] Downloading: 6.84kB [00:00, 15.4kB/s] Downloading and preparing dataset wikitext/wikitext-2-raw-v1 (download: 4.50 MiB, generated: 12.90 MiB, post-processed: Unknown size, total: 17.40 MiB) to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126... Downloading: 243B [00:00, 282kB/s] Traceback (most recent call last): File "/workspace/decoupleQ/llama.py", line 427, in dataloader, testloader = get_loaders( File "/workspace/decoupleQ/datautils.py", line 206, in get_loaders return get_wikitext2(nsamples, seed, seqlen, model) File "/workspace/decoupleQ/datautils.py", line 22, in get_wikitext2 traindata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train') File "/opt/conda/lib/python3.10/site-packages/datasets/load.py", line 1694, in load_dataset builder_instance.download_and_prepare( File "/opt/conda/lib/python3.10/site-packages/datasets/builder.py", line 595, in download_and_prepare self._download_and_prepare( File "/opt/conda/lib/python3.10/site-packages/datasets/builder.py", line 665, in _download_and_prepare verify_checksums( File "/opt/conda/lib/python3.10/site-packages/datasets/utils/info_utils.py", line 40, in verify_checksums raise NonMatchingChecksumError(error_msg + str(bad_urls)) datasets.utils.info_utils.NonMatchingChecksumError: Checksums didn't match for dataset source files: ['https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip']

这个是下载数据集的时候出错了。可能需要加代理，才能下载。要不先提前把数据下载好了，再开始量化。免得花了几小时量化完，才发现下载数据集又挂了

chuangzhidan commented 3 months ago

https://github.com/bytedance/decoupleQ/blob/6fe5a2196512eae2634e58cf0c5ff5dd2949e5fc/csrc/w2a16.cu#L155-L177

可以看下这个函数的pack的过程，意思就是现在的模型是int8的数据类型保存的是int2的数据，这里的pack就是把4个int8 变成一个int8，通过位运算来实现。如果你离线pack，部署的时候需要把w2a16.cu这里的L155-L177行去掉，可能还需要适当改造下。

感谢作者耐心和及时的回答。最后：load_dataset('wikitext', 'wikitext-2-raw-v1', split='train')我是下载这个数据到哪个地址？pack的过程我除了，还需要什么输入什么命令行运行？还是什么都不变，只是删掉那部分十几行代码？

chuangzhidan commented 3 months ago

我们正在写一些example，包括如何产出量化模型，编译TensorRT-LLM以及如何使用w2a16的kernel替换torch的bf16/fp16的kernel进行推理。后续我们会尽快开源到这个仓库。

pip3 install datasets==1.17.0 python llama.py /media/data/xgp/model/Unichat-llama3-Chinese-8B-28K c4 --true-sequential --act-order --new-eval --wbits 2 --group-size -1 --nsamples 128 --max-iter-num 4 --iters-before-round 200 --inner-iters-for-round 5 --blockwise-minimize-epoch 4 --round-fn gptq --blockwise-minimize-lr 1.0e-5 --train-LN --save 运行报错如下，想问下是什么原因？文件路径？ WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv usage: llama.py [-h] [--seed SEED] [--nsamples NSAMPLES] [--percdamp PERCDAMP] [--nearest] [--wbits {2,3,4,8,16}] [--group-size GROUP_SIZE] [--sym] [--save] [--new-eval] [--act-order] [--true-sequential] [--static-groups] [--quant-method {optq,moq,moq_sequential,}] [--loss-thr LOSS_THR] [--max-iter-num MAX_ITER_NUM] [--inner-iters-for-round INNER_ITERS_FOR_ROUND] [--iters-before-round ITERS_BEFORE_ROUND] [--lr LR] [--round-fn {gptq,train}] [--blockwise-minimize-lr BLOCKWISE_MINIMIZE_LR] [--blockwise-minimize-wd BLOCKWISE_MINIMIZE_WD] [--blockwise-minimize-epoch BLOCKWISE_MINIMIZE_EPOCH] [--train-LN] [--train-bias] model {wikitext2,ptb,c4} llama.py: error: unrecognized arguments: run_llama.sh: line 4: --wbits: command not found run_llama.sh: line 5: --group-size: command not found run_llama.sh: line 6: --nsamples: command not found run_llama.sh: line 7: --max-iter-num: command not found run_llama.sh: line 8: --iters-before-round: command not found run_llama.sh: line 9: --inner-iters-for-round: command not found run_llama.sh: line 10: --blockwise-minimize-epoch: command not found run_llama.sh: line 11: --round-fn: command not found run_llama.sh: line 12: --blockwise-minimize-lr: command not found run_llama.sh: line 13: --train-LN: command not found run_llama.sh: line 14: --save: command not found

每一行命令后面要加空格和反斜杠\

是地址的原因，解决了，但是量化了大半天，最后遇到一个错误： ······ time cost for block minimization: 99.12145829200745 quant layer 31 done! time cost 298.48425579071045 （这个是分钟的计算单位吗） The quantization duration is 2.627319086591403（这个是小时的计算单位吗） Downloading: 8.48kB [00:00, 9.63MB/s] Downloading: 6.84kB [00:00, 15.4kB/s] Downloading and preparing dataset wikitext/wikitext-2-raw-v1 (download: 4.50 MiB, generated: 12.90 MiB, post-processed: Unknown size, total: 17.40 MiB) to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126... Downloading: 243B [00:00, 282kB/s] Traceback (most recent call last): File "/workspace/decoupleQ/llama.py", line 427, in dataloader, testloader = get_loaders( File "/workspace/decoupleQ/datautils.py", line 206, in get_loaders return get_wikitext2(nsamples, seed, seqlen, model) File "/workspace/decoupleQ/datautils.py", line 22, in get_wikitext2 traindata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train') File "/opt/conda/lib/python3.10/site-packages/datasets/load.py", line 1694, in load_dataset builder_instance.download_and_prepare( File "/opt/conda/lib/python3.10/site-packages/datasets/builder.py", line 595, in download_and_prepare self._download_and_prepare( File "/opt/conda/lib/python3.10/site-packages/datasets/builder.py", line 665, in _download_and_prepare verify_checksums( File "/opt/conda/lib/python3.10/site-packages/datasets/utils/info_utils.py", line 40, in verify_checksums raise NonMatchingChecksumError(error_msg + str(bad_urls)) datasets.utils.info_utils.NonMatchingChecksumError: Checksums didn't match for dataset source files: ['https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip']

这个是下载数据集的时候出错了。可能需要加代理，才能下载。要不先提前把数据下载好了，再开始量化。免得花了几小时量化完，才发现下载数据集又挂了

麻烦问下，不知道huggingface的wikitext这个数据集是不是要下载的（https://huggingface.co/datasets/wikitext），下载到哪个路径哈？

GuoYi0 commented 3 months ago

我们正在写一些example，包括如何产出量化模型，编译TensorRT-LLM以及如何使用w2a16的kernel替换torch的bf16/fp16的kernel进行推理。后续我们会尽快开源到这个仓库。

pip3 install datasets==1.17.0 python llama.py /media/data/xgp/model/Unichat-llama3-Chinese-8B-28K c4 --true-sequential --act-order --new-eval --wbits 2 --group-size -1 --nsamples 128 --max-iter-num 4 --iters-before-round 200 --inner-iters-for-round 5 --blockwise-minimize-epoch 4 --round-fn gptq --blockwise-minimize-lr 1.0e-5 --train-LN --save 运行报错如下，想问下是什么原因？文件路径？ WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv usage: llama.py [-h] [--seed SEED] [--nsamples NSAMPLES] [--percdamp PERCDAMP] [--nearest] [--wbits {2,3,4,8,16}] [--group-size GROUP_SIZE] [--sym] [--save] [--new-eval] [--act-order] [--true-sequential] [--static-groups] [--quant-method {optq,moq,moq_sequential,}] [--loss-thr LOSS_THR] [--max-iter-num MAX_ITER_NUM] [--inner-iters-for-round INNER_ITERS_FOR_ROUND] [--iters-before-round ITERS_BEFORE_ROUND] [--lr LR] [--round-fn {gptq,train}] [--blockwise-minimize-lr BLOCKWISE_MINIMIZE_LR] [--blockwise-minimize-wd BLOCKWISE_MINIMIZE_WD] [--blockwise-minimize-epoch BLOCKWISE_MINIMIZE_EPOCH] [--train-LN] [--train-bias] model {wikitext2,ptb,c4} llama.py: error: unrecognized arguments: run_llama.sh: line 4: --wbits: command not found run_llama.sh: line 5: --group-size: command not found run_llama.sh: line 6: --nsamples: command not found run_llama.sh: line 7: --max-iter-num: command not found run_llama.sh: line 8: --iters-before-round: command not found run_llama.sh: line 9: --inner-iters-for-round: command not found run_llama.sh: line 10: --blockwise-minimize-epoch: command not found run_llama.sh: line 11: --round-fn: command not found run_llama.sh: line 12: --blockwise-minimize-lr: command not found run_llama.sh: line 13: --train-LN: command not found run_llama.sh: line 14: --save: command not found

每一行命令后面要加空格和反斜杠\

是地址的原因，解决了，但是量化了大半天，最后遇到一个错误： ······ time cost for block minimization: 99.12145829200745 quant layer 31 done! time cost 298.48425579071045 （这个是分钟的计算单位吗） The quantization duration is 2.627319086591403（这个是小时的计算单位吗） Downloading: 8.48kB [00:00, 9.63MB/s] Downloading: 6.84kB [00:00, 15.4kB/s] Downloading and preparing dataset wikitext/wikitext-2-raw-v1 (download: 4.50 MiB, generated: 12.90 MiB, post-processed: Unknown size, total: 17.40 MiB) to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126... Downloading: 243B [00:00, 282kB/s] Traceback (most recent call last): File "/workspace/decoupleQ/llama.py", line 427, in dataloader, testloader = get_loaders( File "/workspace/decoupleQ/datautils.py", line 206, in get_loaders return get_wikitext2(nsamples, seed, seqlen, model) File "/workspace/decoupleQ/datautils.py", line 22, in get_wikitext2 traindata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train') File "/opt/conda/lib/python3.10/site-packages/datasets/load.py", line 1694, in load_dataset builder_instance.download_and_prepare( File "/opt/conda/lib/python3.10/site-packages/datasets/builder.py", line 595, in download_and_prepare self._download_and_prepare( File "/opt/conda/lib/python3.10/site-packages/datasets/builder.py", line 665, in _download_and_prepare verify_checksums( File "/opt/conda/lib/python3.10/site-packages/datasets/utils/info_utils.py", line 40, in verify_checksums raise NonMatchingChecksumError(error_msg + str(bad_urls)) datasets.utils.info_utils.NonMatchingChecksumError: Checksums didn't match for dataset source files: ['https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip']

这个是下载数据集的时候出错了。可能需要加代理，才能下载。要不先提前把数据下载好了，再开始量化。免得花了几小时量化完，才发现下载数据集又挂了

麻烦问下，不知道huggingface的wikitext这个数据集是不是要下载的（https://huggingface.co/datasets/wikitext），下载到哪个路径哈？

要不，搞个代理，数据集会缓存在.cache；然后把.cache保存下来？

chuangzhidan commented 3 months ago

我们正在写一些example，包括如何产出量化模型，编译TensorRT-LLM以及如何使用w2a16的kernel替换torch的bf16/fp16的kernel进行推理。后续我们会尽快开源到这个仓库。

pip3 install datasets==1.17.0 python llama.py /media/data/xgp/model/Unichat-llama3-Chinese-8B-28K c4 --true-sequential --act-order --new-eval --wbits 2 --group-size -1 --nsamples 128 --max-iter-num 4 --iters-before-round 200 --inner-iters-for-round 5 --blockwise-minimize-epoch 4 --round-fn gptq --blockwise-minimize-lr 1.0e-5 --train-LN --save 运行报错如下，想问下是什么原因？文件路径？ WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv usage: llama.py [-h] [--seed SEED] [--nsamples NSAMPLES] [--percdamp PERCDAMP] [--nearest] [--wbits {2,3,4,8,16}] [--group-size GROUP_SIZE] [--sym] [--save] [--new-eval] [--act-order] [--true-sequential] [--static-groups] [--quant-method {optq,moq,moq_sequential,}] [--loss-thr LOSS_THR] [--max-iter-num MAX_ITER_NUM] [--inner-iters-for-round INNER_ITERS_FOR_ROUND] [--iters-before-round ITERS_BEFORE_ROUND] [--lr LR] [--round-fn {gptq,train}] [--blockwise-minimize-lr BLOCKWISE_MINIMIZE_LR] [--blockwise-minimize-wd BLOCKWISE_MINIMIZE_WD] [--blockwise-minimize-epoch BLOCKWISE_MINIMIZE_EPOCH] [--train-LN] [--train-bias] model {wikitext2,ptb,c4} llama.py: error: unrecognized arguments: run_llama.sh: line 4: --wbits: command not found run_llama.sh: line 5: --group-size: command not found run_llama.sh: line 6: --nsamples: command not found run_llama.sh: line 7: --max-iter-num: command not found run_llama.sh: line 8: --iters-before-round: command not found run_llama.sh: line 9: --inner-iters-for-round: command not found run_llama.sh: line 10: --blockwise-minimize-epoch: command not found run_llama.sh: line 11: --round-fn: command not found run_llama.sh: line 12: --blockwise-minimize-lr: command not found run_llama.sh: line 13: --train-LN: command not found run_llama.sh: line 14: --save: command not found

每一行命令后面要加空格和反斜杠\

是地址的原因，解决了，但是量化了大半天，最后遇到一个错误： ······ time cost for block minimization: 99.12145829200745 quant layer 31 done! time cost 298.48425579071045 （这个是分钟的计算单位吗） The quantization duration is 2.627319086591403（这个是小时的计算单位吗） Downloading: 8.48kB [00:00, 9.63MB/s] Downloading: 6.84kB [00:00, 15.4kB/s] Downloading and preparing dataset wikitext/wikitext-2-raw-v1 (download: 4.50 MiB, generated: 12.90 MiB, post-processed: Unknown size, total: 17.40 MiB) to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126... Downloading: 243B [00:00, 282kB/s] Traceback (most recent call last): File "/workspace/decoupleQ/llama.py", line 427, in dataloader, testloader = get_loaders( File "/workspace/decoupleQ/datautils.py", line 206, in get_loaders return get_wikitext2(nsamples, seed, seqlen, model) File "/workspace/decoupleQ/datautils.py", line 22, in get_wikitext2 traindata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train') File "/opt/conda/lib/python3.10/site-packages/datasets/load.py", line 1694, in load_dataset builder_instance.download_and_prepare( File "/opt/conda/lib/python3.10/site-packages/datasets/builder.py", line 595, in download_and_prepare self._download_and_prepare( File "/opt/conda/lib/python3.10/site-packages/datasets/builder.py", line 665, in _download_and_prepare verify_checksums( File "/opt/conda/lib/python3.10/site-packages/datasets/utils/info_utils.py", line 40, in verify_checksums raise NonMatchingChecksumError(error_msg + str(bad_urls)) datasets.utils.info_utils.NonMatchingChecksumError: Checksums didn't match for dataset source files: ['https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip']

这个是下载数据集的时候出错了。可能需要加代理，才能下载。要不先提前把数据下载好了，再开始量化。免得花了几小时量化完，才发现下载数据集又挂了

麻烦问下，不知道huggingface的wikitext这个数据集是不是要下载的（https://huggingface.co/datasets/wikitext），下载到哪个路径哈？

要不，搞个代理，数据集会缓存在.cache；然后把.cache保存下来？

主要是去哪里手动下载这个地址？去hf找，发现有不少一样名称但又不一样的，不知下载哪一个。还有数据集地址打不开，比如wiki，https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip打开就是： This XML file does not appear to have any style information associated with it. The document tree is shown below.

AccessDenied Access Denied 2K9HKR96E4D0SSKB /gNTJqI0M9Ku8PfFNgObmn3uequLpErlZFar/YE++q4ClY4Q4vuf9+rWlsmVatx9/bLbZZ/ahf4=

。不管有没有科学上网就是打不开。服务器代理就不清楚了，只知道电脑手动下载目前的下载路径是这个不知道正不正常（其中wiki文件夹是因为AccessDenied下载不了，里面是空的）： b09e33cff01fd9a2d1b18e79a9518b6c