bytedance / decoupleQ

A quantization algorithm for LLM
Apache License 2.0
94 stars 5 forks source link

关于量化模型推理 #1

Open ChuanhongLi opened 3 months ago

ChuanhongLi commented 3 months ago

非常赞的工作!有两个问题问题请教一下: 1)使用 decoupleQ 量化后的LLaMA模型,推理性能如何?是否有对应一些测试数据? 2)量化后的模型如何部署进行推理呢? README 中,有看到 NVIDIA/TensorRT-LLM#1568,是直接使用 TensorRT-LLM 进行部署吗?然后请问有对应的推理部署脚本吗?

谢谢!

gavinchen430 commented 3 months ago

我们正在写一些example,包括如何产出量化模型,编译TensorRT-LLM以及如何使用w2a16的kernel替换torch的bf16/fp16的kernel进行推理。后续我们会尽快开源到这个仓库。

ChuanhongLi commented 3 months ago

我们正在写一些example,包括如何产出量化模型,编译TensorRT-LLM以及如何使用w2a16的kernel替换torch的bf16/fp16的kernel进行推理。后续我们会尽快开源到这个仓库。

感谢回复!顺便问下,true_quant.pth 保存的是量化后的权重吗?能否通过直接替换到原始model 对应的权重信息,通过transformers 库进行推理,简单看下模型输出?还是说必须得通过 tensorrt-llm 进行推理?

ChuanhongLi commented 3 months ago

我们对 Llama-2-7b-hf 进行了量化, --wbits 2,有点疑问的是,保存下来的 true_quant.pth 有 6.6G, 这正常吗?然后我们打印了下 true_quant.pth 的数据,数据类型有的为 dtype=torch.int8, 有的为 torch.float16

run_llama.sh
python3 llama.py LLaMA/Llama-2-7b-hf/ c4 --true-sequential --act-order --new-eval \
--wbits 2 \
--group-size -1 \
--nsamples 128 \
--max-iter-num 4 \
--iters-before-round 200 \
--inner-iters-for-round 5 \
--blockwise-minimize-epoch 4 \
--round-fn gptq \
--blockwise-minimize-lr 1.0e-5 \
--train-LN \
--save
GuoYi0 commented 3 months ago

@ChuanhongLi 保存的true_quant.pth,主weights确实是int8,其取值范围是{-2,-1,0,1};scale 和zero等等,是fp16

ChuanhongLi commented 3 months ago

我们正在写一些example,包括如何产出量化模型,编译TensorRT-LLM以及如何使用w2a16的kernel替换torch的bf16/fp16的kernel进行推理。后续我们会尽快开源到这个仓库。

感谢回复!顺便问下,true_quant.pth 保存的是量化后的权重吗?能否通过直接替换到原始model 对应的权重信息,通过transformers 库进行推理,简单看下模型输出?还是说必须得通过 tensorrt-llm 进行推理?

@GuoYi0 @gavinchen430 这个可行吗?我们想看下量化后的推理效果

gavinchen430 commented 3 months ago

true_quant.pth 是用int8保存的int2的数据,我们在推理时,会将int8的数据pack成int2的数据,所以实际gpu显存是降低四倍的,现在用8bit保存模型的目的主要是方便调试和对齐,没有什么其他特别的作用。所以如果感觉模型比较大,可以将这个pack的过程在导出的时候就做完。

true_quant.pth 是无法使用transformers库进行推理的,因为现在2bit的weight-only的算子,我们主要在trtllm里面支持的,在https://github.com/bytedance/decoupleQ/pull/2/ 这个PR里面,展示了如何简单的使用trtllm的kernel替换torch原生的gemm来推理2bit的模型。

如果只是想验证精度,你可以使用transformers推理fake_quant.pth,这个模型是true_quant.pth 反量化后得到的fp16的模型,所以数值上是等价的。

chuangzhidan commented 3 months ago

我们正在写一些example,包括如何产出量化模型,编译TensorRT-LLM以及如何使用w2a16的kernel替换torch的bf16/fp16的kernel进行推理。后续我们会尽快开源到这个仓库。

pip3 install datasets==1.17.0

python llama.py /media/data/xgp/model/Unichat-llama3-Chinese-8B-28K c4 --true-sequential --act-order --new-eval \ --wbits 2 \ --group-size -1 \ --nsamples 128 \ --max-iter-num 4 \ --iters-before-round 200 \ --inner-iters-for-round 5 \ --blockwise-minimize-epoch 4 \ --round-fn gptq \ --blockwise-minimize-lr 1.0e-5 \ --train-LN \ --save

运行报错如下,想问下是什么原因?文件路径? WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv usage: llama.py [-h] [--seed SEED] [--nsamples NSAMPLES] [--percdamp PERCDAMP] [--nearest] [--wbits {2,3,4,8,16}] [--group-size GROUP_SIZE] [--sym] [--save] [--new-eval] [--act-order] [--true-sequential] [--static-groups] [--quant-method {optq,moq,moq_sequential,}] [--loss-thr LOSS_THR] [--max-iter-num MAX_ITER_NUM] [--inner-iters-for-round INNER_ITERS_FOR_ROUND] [--iters-before-round ITERS_BEFORE_ROUND] [--lr LR] [--round-fn {gptq,train}] [--blockwise-minimize-lr BLOCKWISE_MINIMIZE_LR] [--blockwise-minimize-wd BLOCKWISE_MINIMIZE_WD] [--blockwise-minimize-epoch BLOCKWISE_MINIMIZE_EPOCH] [--train-LN] [--train-bias] model {wikitext2,ptb,c4} llama.py: error: unrecognized arguments:
run_llama.sh: line 4: --wbits: command not found run_llama.sh: line 5: --group-size: command not found run_llama.sh: line 6: --nsamples: command not found run_llama.sh: line 7: --max-iter-num: command not found run_llama.sh: line 8: --iters-before-round: command not found run_llama.sh: line 9: --inner-iters-for-round: command not found run_llama.sh: line 10: --blockwise-minimize-epoch: command not found run_llama.sh: line 11: --round-fn: command not found run_llama.sh: line 12: --blockwise-minimize-lr: command not found run_llama.sh: line 13: --train-LN: command not found run_llama.sh: line 14: --save: command not found

GuoYi0 commented 3 months ago

我们正在写一些example,包括如何产出量化模型,编译TensorRT-LLM以及如何使用w2a16的kernel替换torch的bf16/fp16的kernel进行推理。后续我们会尽快开源到这个仓库。

pip3 install datasets==1.17.0

python llama.py /media/data/xgp/model/Unichat-llama3-Chinese-8B-28K c4 --true-sequential --act-order --new-eval --wbits 2 --group-size -1 --nsamples 128 --max-iter-num 4 --iters-before-round 200 --inner-iters-for-round 5 --blockwise-minimize-epoch 4 --round-fn gptq --blockwise-minimize-lr 1.0e-5 --train-LN --save

运行报错如下,想问下是什么原因?文件路径? WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv usage: llama.py [-h] [--seed SEED] [--nsamples NSAMPLES] [--percdamp PERCDAMP] [--nearest] [--wbits {2,3,4,8,16}] [--group-size GROUP_SIZE] [--sym] [--save] [--new-eval] [--act-order] [--true-sequential] [--static-groups] [--quant-method {optq,moq,moq_sequential,}] [--loss-thr LOSS_THR] [--max-iter-num MAX_ITER_NUM] [--inner-iters-for-round INNER_ITERS_FOR_ROUND] [--iters-before-round ITERS_BEFORE_ROUND] [--lr LR] [--round-fn {gptq,train}] [--blockwise-minimize-lr BLOCKWISE_MINIMIZE_LR] [--blockwise-minimize-wd BLOCKWISE_MINIMIZE_WD] [--blockwise-minimize-epoch BLOCKWISE_MINIMIZE_EPOCH] [--train-LN] [--train-bias] model {wikitext2,ptb,c4} llama.py: error: unrecognized arguments: run_llama.sh: line 4: --wbits: command not found run_llama.sh: line 5: --group-size: command not found run_llama.sh: line 6: --nsamples: command not found run_llama.sh: line 7: --max-iter-num: command not found run_llama.sh: line 8: --iters-before-round: command not found run_llama.sh: line 9: --inner-iters-for-round: command not found run_llama.sh: line 10: --blockwise-minimize-epoch: command not found run_llama.sh: line 11: --round-fn: command not found run_llama.sh: line 12: --blockwise-minimize-lr: command not found run_llama.sh: line 13: --train-LN: command not found run_llama.sh: line 14: --save: command not found

每一行命令后面要加空格和反斜杠\

chuangzhidan commented 3 months ago

我们正在写一些example,包括如何产出量化模型,编译TensorRT-LLM以及如何使用w2a16的kernel替换torch的bf16/fp16的kernel进行推理。后续我们会尽快开源到这个仓库。

pip3 install datasets==1.17.0 python llama.py /media/data/xgp/model/Unichat-llama3-Chinese-8B-28K c4 --true-sequential --act-order --new-eval --wbits 2 --group-size -1 --nsamples 128 --max-iter-num 4 --iters-before-round 200 --inner-iters-for-round 5 --blockwise-minimize-epoch 4 --round-fn gptq --blockwise-minimize-lr 1.0e-5 --train-LN --save 运行报错如下,想问下是什么原因?文件路径? WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv usage: llama.py [-h] [--seed SEED] [--nsamples NSAMPLES] [--percdamp PERCDAMP] [--nearest] [--wbits {2,3,4,8,16}] [--group-size GROUP_SIZE] [--sym] [--save] [--new-eval] [--act-order] [--true-sequential] [--static-groups] [--quant-method {optq,moq,moq_sequential,}] [--loss-thr LOSS_THR] [--max-iter-num MAX_ITER_NUM] [--inner-iters-for-round INNER_ITERS_FOR_ROUND] [--iters-before-round ITERS_BEFORE_ROUND] [--lr LR] [--round-fn {gptq,train}] [--blockwise-minimize-lr BLOCKWISE_MINIMIZE_LR] [--blockwise-minimize-wd BLOCKWISE_MINIMIZE_WD] [--blockwise-minimize-epoch BLOCKWISE_MINIMIZE_EPOCH] [--train-LN] [--train-bias] model {wikitext2,ptb,c4} llama.py: error: unrecognized arguments: run_llama.sh: line 4: --wbits: command not found run_llama.sh: line 5: --group-size: command not found run_llama.sh: line 6: --nsamples: command not found run_llama.sh: line 7: --max-iter-num: command not found run_llama.sh: line 8: --iters-before-round: command not found run_llama.sh: line 9: --inner-iters-for-round: command not found run_llama.sh: line 10: --blockwise-minimize-epoch: command not found run_llama.sh: line 11: --round-fn: command not found run_llama.sh: line 12: --blockwise-minimize-lr: command not found run_llama.sh: line 13: --train-LN: command not found run_llama.sh: line 14: --save: command not found

每一行命令后面要加空格和反斜杠\

是地址的原因,解决了,但是量化了大半天,最后遇到一个错误: ······ time cost for block minimization: 99.12145829200745 quant layer 31 done! time cost 298.48425579071045 (这个是分钟的计算单位吗)

The quantization duration is 2.627319086591403(这个是小时的计算单位吗) Downloading: 8.48kB [00:00, 9.63MB/s]
Downloading: 6.84kB [00:00, 15.4kB/s]
Downloading and preparing dataset wikitext/wikitext-2-raw-v1 (download: 4.50 MiB, generated: 12.90 MiB, post-processed: Unknown size, total: 17.40 MiB) to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126... Downloading: 243B [00:00, 282kB/s] Traceback (most recent call last): File "/workspace/decoupleQ/llama.py", line 427, in dataloader, testloader = get_loaders( File "/workspace/decoupleQ/datautils.py", line 206, in get_loaders return get_wikitext2(nsamples, seed, seqlen, model) File "/workspace/decoupleQ/datautils.py", line 22, in get_wikitext2 traindata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train') File "/opt/conda/lib/python3.10/site-packages/datasets/load.py", line 1694, in load_dataset builder_instance.download_and_prepare( File "/opt/conda/lib/python3.10/site-packages/datasets/builder.py", line 595, in download_and_prepare self._download_and_prepare( File "/opt/conda/lib/python3.10/site-packages/datasets/builder.py", line 665, in _download_and_prepare verify_checksums( File "/opt/conda/lib/python3.10/site-packages/datasets/utils/info_utils.py", line 40, in verify_checksums raise NonMatchingChecksumError(error_msg + str(bad_urls)) datasets.utils.info_utils.NonMatchingChecksumError: Checksums didn't match for dataset source files: ['https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip']

chuangzhidan commented 3 months ago

如果感觉模型比较大,可以将这个pack的过程在导出的时候就做完。

true_quant.pth 是原模型的二分之一大小,导出pack是不是只有truequant的四分之一、原模型的八分之一了?怎么在导出的时候做哈。^^

gavinchen430 commented 3 months ago

https://github.com/bytedance/decoupleQ/blob/6fe5a2196512eae2634e58cf0c5ff5dd2949e5fc/csrc/w2a16.cu#L155-L177

可以看下这个函数的pack的过程,意思就是现在的模型是int8的数据类型保存的是int2的数据,这里的pack就是把4个int8 变成一个int8,通过位运算来实现。如果你离线pack,部署的时候需要把w2a16.cu这里的L155-L177行去掉,可能还需要适当改造下。

GuoYi0 commented 3 months ago

我们正在写一些example,包括如何产出量化模型,编译TensorRT-LLM以及如何使用w2a16的kernel替换torch的bf16/fp16的kernel进行推理。后续我们会尽快开源到这个仓库。

pip3 install datasets==1.17.0 python llama.py /media/data/xgp/model/Unichat-llama3-Chinese-8B-28K c4 --true-sequential --act-order --new-eval --wbits 2 --group-size -1 --nsamples 128 --max-iter-num 4 --iters-before-round 200 --inner-iters-for-round 5 --blockwise-minimize-epoch 4 --round-fn gptq --blockwise-minimize-lr 1.0e-5 --train-LN --save 运行报错如下,想问下是什么原因?文件路径? WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv usage: llama.py [-h] [--seed SEED] [--nsamples NSAMPLES] [--percdamp PERCDAMP] [--nearest] [--wbits {2,3,4,8,16}] [--group-size GROUP_SIZE] [--sym] [--save] [--new-eval] [--act-order] [--true-sequential] [--static-groups] [--quant-method {optq,moq,moq_sequential,}] [--loss-thr LOSS_THR] [--max-iter-num MAX_ITER_NUM] [--inner-iters-for-round INNER_ITERS_FOR_ROUND] [--iters-before-round ITERS_BEFORE_ROUND] [--lr LR] [--round-fn {gptq,train}] [--blockwise-minimize-lr BLOCKWISE_MINIMIZE_LR] [--blockwise-minimize-wd BLOCKWISE_MINIMIZE_WD] [--blockwise-minimize-epoch BLOCKWISE_MINIMIZE_EPOCH] [--train-LN] [--train-bias] model {wikitext2,ptb,c4} llama.py: error: unrecognized arguments: run_llama.sh: line 4: --wbits: command not found run_llama.sh: line 5: --group-size: command not found run_llama.sh: line 6: --nsamples: command not found run_llama.sh: line 7: --max-iter-num: command not found run_llama.sh: line 8: --iters-before-round: command not found run_llama.sh: line 9: --inner-iters-for-round: command not found run_llama.sh: line 10: --blockwise-minimize-epoch: command not found run_llama.sh: line 11: --round-fn: command not found run_llama.sh: line 12: --blockwise-minimize-lr: command not found run_llama.sh: line 13: --train-LN: command not found run_llama.sh: line 14: --save: command not found

每一行命令后面要加空格和反斜杠\

是地址的原因,解决了,但是量化了大半天,最后遇到一个错误: ······ time cost for block minimization: 99.12145829200745 quant layer 31 done! time cost 298.48425579071045 (这个是分钟的计算单位吗)

The quantization duration is 2.627319086591403(这个是小时的计算单位吗) Downloading: 8.48kB [00:00, 9.63MB/s] Downloading: 6.84kB [00:00, 15.4kB/s] Downloading and preparing dataset wikitext/wikitext-2-raw-v1 (download: 4.50 MiB, generated: 12.90 MiB, post-processed: Unknown size, total: 17.40 MiB) to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126... Downloading: 243B [00:00, 282kB/s] Traceback (most recent call last): File "/workspace/decoupleQ/llama.py", line 427, in dataloader, testloader = get_loaders( File "/workspace/decoupleQ/datautils.py", line 206, in get_loaders return get_wikitext2(nsamples, seed, seqlen, model) File "/workspace/decoupleQ/datautils.py", line 22, in get_wikitext2 traindata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train') File "/opt/conda/lib/python3.10/site-packages/datasets/load.py", line 1694, in load_dataset builder_instance.download_and_prepare( File "/opt/conda/lib/python3.10/site-packages/datasets/builder.py", line 595, in download_and_prepare self._download_and_prepare( File "/opt/conda/lib/python3.10/site-packages/datasets/builder.py", line 665, in _download_and_prepare verify_checksums( File "/opt/conda/lib/python3.10/site-packages/datasets/utils/info_utils.py", line 40, in verify_checksums raise NonMatchingChecksumError(error_msg + str(bad_urls)) datasets.utils.info_utils.NonMatchingChecksumError: Checksums didn't match for dataset source files: ['https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip']

这个是下载数据集的时候出错了。可能需要加代理,才能下载。要不先提前把数据下载好了,再开始量化。免得花了几小时量化完,才发现下载数据集又挂了

chuangzhidan commented 3 months ago

https://github.com/bytedance/decoupleQ/blob/6fe5a2196512eae2634e58cf0c5ff5dd2949e5fc/csrc/w2a16.cu#L155-L177

可以看下这个函数的pack的过程,意思就是现在的模型是int8的数据类型保存的是int2的数据,这里的pack就是把4个int8 变成一个int8,通过位运算来实现。如果你离线pack,部署的时候需要把w2a16.cu这里的L155-L177行去掉,可能还需要适当改造下。

感谢作者耐心和及时的回答。最后:load_dataset('wikitext', 'wikitext-2-raw-v1', split='train')我是下载这个数据到哪个地址?pack的过程我除了,还需要什么输入什么命令行运行?还是什么都不变,只是删掉那部分十几行代码?

chuangzhidan commented 3 months ago

我们正在写一些example,包括如何产出量化模型,编译TensorRT-LLM以及如何使用w2a16的kernel替换torch的bf16/fp16的kernel进行推理。后续我们会尽快开源到这个仓库。

pip3 install datasets==1.17.0 python llama.py /media/data/xgp/model/Unichat-llama3-Chinese-8B-28K c4 --true-sequential --act-order --new-eval --wbits 2 --group-size -1 --nsamples 128 --max-iter-num 4 --iters-before-round 200 --inner-iters-for-round 5 --blockwise-minimize-epoch 4 --round-fn gptq --blockwise-minimize-lr 1.0e-5 --train-LN --save 运行报错如下,想问下是什么原因?文件路径? WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv usage: llama.py [-h] [--seed SEED] [--nsamples NSAMPLES] [--percdamp PERCDAMP] [--nearest] [--wbits {2,3,4,8,16}] [--group-size GROUP_SIZE] [--sym] [--save] [--new-eval] [--act-order] [--true-sequential] [--static-groups] [--quant-method {optq,moq,moq_sequential,}] [--loss-thr LOSS_THR] [--max-iter-num MAX_ITER_NUM] [--inner-iters-for-round INNER_ITERS_FOR_ROUND] [--iters-before-round ITERS_BEFORE_ROUND] [--lr LR] [--round-fn {gptq,train}] [--blockwise-minimize-lr BLOCKWISE_MINIMIZE_LR] [--blockwise-minimize-wd BLOCKWISE_MINIMIZE_WD] [--blockwise-minimize-epoch BLOCKWISE_MINIMIZE_EPOCH] [--train-LN] [--train-bias] model {wikitext2,ptb,c4} llama.py: error: unrecognized arguments: run_llama.sh: line 4: --wbits: command not found run_llama.sh: line 5: --group-size: command not found run_llama.sh: line 6: --nsamples: command not found run_llama.sh: line 7: --max-iter-num: command not found run_llama.sh: line 8: --iters-before-round: command not found run_llama.sh: line 9: --inner-iters-for-round: command not found run_llama.sh: line 10: --blockwise-minimize-epoch: command not found run_llama.sh: line 11: --round-fn: command not found run_llama.sh: line 12: --blockwise-minimize-lr: command not found run_llama.sh: line 13: --train-LN: command not found run_llama.sh: line 14: --save: command not found

每一行命令后面要加空格和反斜杠\

是地址的原因,解决了,但是量化了大半天,最后遇到一个错误: ······ time cost for block minimization: 99.12145829200745 quant layer 31 done! time cost 298.48425579071045 (这个是分钟的计算单位吗) The quantization duration is 2.627319086591403(这个是小时的计算单位吗) Downloading: 8.48kB [00:00, 9.63MB/s] Downloading: 6.84kB [00:00, 15.4kB/s] Downloading and preparing dataset wikitext/wikitext-2-raw-v1 (download: 4.50 MiB, generated: 12.90 MiB, post-processed: Unknown size, total: 17.40 MiB) to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126... Downloading: 243B [00:00, 282kB/s] Traceback (most recent call last): File "/workspace/decoupleQ/llama.py", line 427, in dataloader, testloader = get_loaders( File "/workspace/decoupleQ/datautils.py", line 206, in get_loaders return get_wikitext2(nsamples, seed, seqlen, model) File "/workspace/decoupleQ/datautils.py", line 22, in get_wikitext2 traindata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train') File "/opt/conda/lib/python3.10/site-packages/datasets/load.py", line 1694, in load_dataset builder_instance.download_and_prepare( File "/opt/conda/lib/python3.10/site-packages/datasets/builder.py", line 595, in download_and_prepare self._download_and_prepare( File "/opt/conda/lib/python3.10/site-packages/datasets/builder.py", line 665, in _download_and_prepare verify_checksums( File "/opt/conda/lib/python3.10/site-packages/datasets/utils/info_utils.py", line 40, in verify_checksums raise NonMatchingChecksumError(error_msg + str(bad_urls)) datasets.utils.info_utils.NonMatchingChecksumError: Checksums didn't match for dataset source files: ['https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip']

这个是下载数据集的时候出错了。可能需要加代理,才能下载。要不先提前把数据下载好了,再开始量化。免得花了几小时量化完,才发现下载数据集又挂了

麻烦问下,不知道huggingface的wikitext这个数据集是不是要下载的(https://huggingface.co/datasets/wikitext),下载到哪个路径哈

GuoYi0 commented 3 months ago

我们正在写一些example,包括如何产出量化模型,编译TensorRT-LLM以及如何使用w2a16的kernel替换torch的bf16/fp16的kernel进行推理。后续我们会尽快开源到这个仓库。

pip3 install datasets==1.17.0 python llama.py /media/data/xgp/model/Unichat-llama3-Chinese-8B-28K c4 --true-sequential --act-order --new-eval --wbits 2 --group-size -1 --nsamples 128 --max-iter-num 4 --iters-before-round 200 --inner-iters-for-round 5 --blockwise-minimize-epoch 4 --round-fn gptq --blockwise-minimize-lr 1.0e-5 --train-LN --save 运行报错如下,想问下是什么原因?文件路径? WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv usage: llama.py [-h] [--seed SEED] [--nsamples NSAMPLES] [--percdamp PERCDAMP] [--nearest] [--wbits {2,3,4,8,16}] [--group-size GROUP_SIZE] [--sym] [--save] [--new-eval] [--act-order] [--true-sequential] [--static-groups] [--quant-method {optq,moq,moq_sequential,}] [--loss-thr LOSS_THR] [--max-iter-num MAX_ITER_NUM] [--inner-iters-for-round INNER_ITERS_FOR_ROUND] [--iters-before-round ITERS_BEFORE_ROUND] [--lr LR] [--round-fn {gptq,train}] [--blockwise-minimize-lr BLOCKWISE_MINIMIZE_LR] [--blockwise-minimize-wd BLOCKWISE_MINIMIZE_WD] [--blockwise-minimize-epoch BLOCKWISE_MINIMIZE_EPOCH] [--train-LN] [--train-bias] model {wikitext2,ptb,c4} llama.py: error: unrecognized arguments: run_llama.sh: line 4: --wbits: command not found run_llama.sh: line 5: --group-size: command not found run_llama.sh: line 6: --nsamples: command not found run_llama.sh: line 7: --max-iter-num: command not found run_llama.sh: line 8: --iters-before-round: command not found run_llama.sh: line 9: --inner-iters-for-round: command not found run_llama.sh: line 10: --blockwise-minimize-epoch: command not found run_llama.sh: line 11: --round-fn: command not found run_llama.sh: line 12: --blockwise-minimize-lr: command not found run_llama.sh: line 13: --train-LN: command not found run_llama.sh: line 14: --save: command not found

每一行命令后面要加空格和反斜杠\

是地址的原因,解决了,但是量化了大半天,最后遇到一个错误: ······ time cost for block minimization: 99.12145829200745 quant layer 31 done! time cost 298.48425579071045 (这个是分钟的计算单位吗) The quantization duration is 2.627319086591403(这个是小时的计算单位吗) Downloading: 8.48kB [00:00, 9.63MB/s] Downloading: 6.84kB [00:00, 15.4kB/s] Downloading and preparing dataset wikitext/wikitext-2-raw-v1 (download: 4.50 MiB, generated: 12.90 MiB, post-processed: Unknown size, total: 17.40 MiB) to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126... Downloading: 243B [00:00, 282kB/s] Traceback (most recent call last): File "/workspace/decoupleQ/llama.py", line 427, in dataloader, testloader = get_loaders( File "/workspace/decoupleQ/datautils.py", line 206, in get_loaders return get_wikitext2(nsamples, seed, seqlen, model) File "/workspace/decoupleQ/datautils.py", line 22, in get_wikitext2 traindata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train') File "/opt/conda/lib/python3.10/site-packages/datasets/load.py", line 1694, in load_dataset builder_instance.download_and_prepare( File "/opt/conda/lib/python3.10/site-packages/datasets/builder.py", line 595, in download_and_prepare self._download_and_prepare( File "/opt/conda/lib/python3.10/site-packages/datasets/builder.py", line 665, in _download_and_prepare verify_checksums( File "/opt/conda/lib/python3.10/site-packages/datasets/utils/info_utils.py", line 40, in verify_checksums raise NonMatchingChecksumError(error_msg + str(bad_urls)) datasets.utils.info_utils.NonMatchingChecksumError: Checksums didn't match for dataset source files: ['https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip']

这个是下载数据集的时候出错了。可能需要加代理,才能下载。要不先提前把数据下载好了,再开始量化。免得花了几小时量化完,才发现下载数据集又挂了

麻烦问下,不知道huggingface的wikitext这个数据集是不是要下载的(https://huggingface.co/datasets/wikitext),下载到哪个路径哈?

要不,搞个代理,数据集会缓存在.cache;然后把.cache保存下来?

chuangzhidan commented 3 months ago

我们正在写一些example,包括如何产出量化模型,编译TensorRT-LLM以及如何使用w2a16的kernel替换torch的bf16/fp16的kernel进行推理。后续我们会尽快开源到这个仓库。

pip3 install datasets==1.17.0 python llama.py /media/data/xgp/model/Unichat-llama3-Chinese-8B-28K c4 --true-sequential --act-order --new-eval --wbits 2 --group-size -1 --nsamples 128 --max-iter-num 4 --iters-before-round 200 --inner-iters-for-round 5 --blockwise-minimize-epoch 4 --round-fn gptq --blockwise-minimize-lr 1.0e-5 --train-LN --save 运行报错如下,想问下是什么原因?文件路径? WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv usage: llama.py [-h] [--seed SEED] [--nsamples NSAMPLES] [--percdamp PERCDAMP] [--nearest] [--wbits {2,3,4,8,16}] [--group-size GROUP_SIZE] [--sym] [--save] [--new-eval] [--act-order] [--true-sequential] [--static-groups] [--quant-method {optq,moq,moq_sequential,}] [--loss-thr LOSS_THR] [--max-iter-num MAX_ITER_NUM] [--inner-iters-for-round INNER_ITERS_FOR_ROUND] [--iters-before-round ITERS_BEFORE_ROUND] [--lr LR] [--round-fn {gptq,train}] [--blockwise-minimize-lr BLOCKWISE_MINIMIZE_LR] [--blockwise-minimize-wd BLOCKWISE_MINIMIZE_WD] [--blockwise-minimize-epoch BLOCKWISE_MINIMIZE_EPOCH] [--train-LN] [--train-bias] model {wikitext2,ptb,c4} llama.py: error: unrecognized arguments: run_llama.sh: line 4: --wbits: command not found run_llama.sh: line 5: --group-size: command not found run_llama.sh: line 6: --nsamples: command not found run_llama.sh: line 7: --max-iter-num: command not found run_llama.sh: line 8: --iters-before-round: command not found run_llama.sh: line 9: --inner-iters-for-round: command not found run_llama.sh: line 10: --blockwise-minimize-epoch: command not found run_llama.sh: line 11: --round-fn: command not found run_llama.sh: line 12: --blockwise-minimize-lr: command not found run_llama.sh: line 13: --train-LN: command not found run_llama.sh: line 14: --save: command not found

每一行命令后面要加空格和反斜杠\

是地址的原因,解决了,但是量化了大半天,最后遇到一个错误: ······ time cost for block minimization: 99.12145829200745 quant layer 31 done! time cost 298.48425579071045 (这个是分钟的计算单位吗) The quantization duration is 2.627319086591403(这个是小时的计算单位吗) Downloading: 8.48kB [00:00, 9.63MB/s] Downloading: 6.84kB [00:00, 15.4kB/s] Downloading and preparing dataset wikitext/wikitext-2-raw-v1 (download: 4.50 MiB, generated: 12.90 MiB, post-processed: Unknown size, total: 17.40 MiB) to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126... Downloading: 243B [00:00, 282kB/s] Traceback (most recent call last): File "/workspace/decoupleQ/llama.py", line 427, in dataloader, testloader = get_loaders( File "/workspace/decoupleQ/datautils.py", line 206, in get_loaders return get_wikitext2(nsamples, seed, seqlen, model) File "/workspace/decoupleQ/datautils.py", line 22, in get_wikitext2 traindata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train') File "/opt/conda/lib/python3.10/site-packages/datasets/load.py", line 1694, in load_dataset builder_instance.download_and_prepare( File "/opt/conda/lib/python3.10/site-packages/datasets/builder.py", line 595, in download_and_prepare self._download_and_prepare( File "/opt/conda/lib/python3.10/site-packages/datasets/builder.py", line 665, in _download_and_prepare verify_checksums( File "/opt/conda/lib/python3.10/site-packages/datasets/utils/info_utils.py", line 40, in verify_checksums raise NonMatchingChecksumError(error_msg + str(bad_urls)) datasets.utils.info_utils.NonMatchingChecksumError: Checksums didn't match for dataset source files: ['https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip']

这个是下载数据集的时候出错了。可能需要加代理,才能下载。要不先提前把数据下载好了,再开始量化。免得花了几小时量化完,才发现下载数据集又挂了

麻烦问下,不知道huggingface的wikitext这个数据集是不是要下载的(https://huggingface.co/datasets/wikitext),下载到哪个路径哈?

要不,搞个代理,数据集会缓存在.cache;然后把.cache保存下来?

主要是去哪里手动下载这个地址?去hf找,发现有不少一样名称但又不一样的,不知下载哪一个。 还有数据集地址打不开,比如wiki,https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip打开就是: This XML file does not appear to have any style information associated with it. The document tree is shown below.

AccessDenied Access Denied 2K9HKR96E4D0SSKB /gNTJqI0M9Ku8PfFNgObmn3uequLpErlZFar/YE++q4ClY4Q4vuf9+rWlsmVatx9/bLbZZ/ahf4=

。不管有没有科学上网就是打不开。服务器代理就不清楚了,只知道电脑手动下载 目前的下载路径是这个不知道正不正常(其中wiki文件夹是因为AccessDenied下载不了,里面是空的): b09e33cff01fd9a2d1b18e79a9518b6c