Open qiuxuezhe123 opened 4 months ago
Can you paste the output of running lmdeploy check_env
?
Can you paste the output of running
lmdeploy check_env
?
下面是lmdeploy环境下运行awq量化的所有输出结果
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|████████████████████████████████████████████████| 4/4 [00:02<00:00, 1.84it/s]
Move model.embed_tokens to GPU.
Move model.layers.0 to CPU.
Move model.layers.1 to CPU.
Move model.layers.2 to CPU.
Move model.layers.3 to CPU.
Move model.layers.4 to CPU.
Move model.layers.5 to CPU.
Move model.layers.6 to CPU.
Move model.layers.7 to CPU.
Move model.layers.8 to CPU.
Move model.layers.9 to CPU.
Move model.layers.10 to CPU.
Move model.layers.11 to CPU.
Move model.layers.12 to CPU.
Move model.layers.13 to CPU.
Move model.layers.14 to CPU.
Move model.layers.15 to CPU.
Move model.layers.16 to CPU.
Move model.layers.17 to CPU.
Move model.layers.18 to CPU.
Move model.layers.19 to CPU.
Move model.layers.20 to CPU.
Move model.layers.21 to CPU.
Move model.layers.22 to CPU.
Move model.layers.23 to CPU.
Move model.layers.24 to CPU.
Move model.layers.25 to CPU.
Move model.layers.26 to CPU.
Move model.layers.27 to CPU.
Move model.norm to GPU.
Move lm_head to CPU.
Loading calibrate dataset ...
Found cached dataset json (/root/.cache/huggingface/datasets/json/c4-3f6237ecfc2df013/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
Found cached dataset json (/root/.cache/huggingface/datasets/json/c4-11668a7e9b799711/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
model.layers.0, samples: 128, max gpu memory: 6.73 GB
model.layers.1, samples: 128, max gpu memory: 8.48 GB
model.layers.2, samples: 128, max gpu memory: 8.48 GB
model.layers.3, samples: 128, max gpu memory: 8.48 GB
model.layers.4, samples: 128, max gpu memory: 8.48 GB
model.layers.5, samples: 128, max gpu memory: 8.48 GB
model.layers.6, samples: 128, max gpu memory: 8.48 GB
model.layers.7, samples: 128, max gpu memory: 8.48 GB
model.layers.8, samples: 128, max gpu memory: 8.48 GB
model.layers.9, samples: 128, max gpu memory: 8.48 GB
model.layers.10, samples: 128, max gpu memory: 8.48 GB
model.layers.11, samples: 128, max gpu memory: 8.48 GB
model.layers.12, samples: 128, max gpu memory: 8.48 GB
model.layers.13, samples: 128, max gpu memory: 8.48 GB
model.layers.14, samples: 128, max gpu memory: 8.48 GB
model.layers.15, samples: 128, max gpu memory: 8.48 GB
model.layers.16, samples: 128, max gpu memory: 8.48 GB
model.layers.17, samples: 128, max gpu memory: 8.48 GB
model.layers.18, samples: 128, max gpu memory: 8.48 GB
model.layers.19, samples: 128, max gpu memory: 8.48 GB
model.layers.20, samples: 128, max gpu memory: 8.48 GB
model.layers.21, samples: 128, max gpu memory: 8.48 GB
model.layers.22, samples: 128, max gpu memory: 8.48 GB
model.layers.23, samples: 128, max gpu memory: 8.48 GB
model.layers.24, samples: 128, max gpu memory: 8.48 GB
model.layers.25, samples: 128, max gpu memory: 8.48 GB
model.layers.26, samples: 128, max gpu memory: 8.48 GB
model.layers.27, samples: 128, max gpu memory: 8.48 GB
model.layers.0 smooth weight done.
model.layers.1 smooth weight done.
model.layers.2 smooth weight done.
model.layers.3 smooth weight done.
model.layers.4 smooth weight done.
model.layers.5 smooth weight done.
model.layers.6 smooth weight done.
model.layers.7 smooth weight done.
model.layers.8 smooth weight done.
model.layers.9 smooth weight done.
model.layers.10 smooth weight done.
model.layers.11 smooth weight done.
model.layers.12 smooth weight done.
model.layers.13 smooth weight done.
model.layers.14 smooth weight done.
model.layers.15 smooth weight done.
model.layers.16 smooth weight done.
model.layers.17 smooth weight done.
model.layers.18 smooth weight done.
model.layers.19 smooth weight done.
model.layers.20 smooth weight done.
model.layers.21 smooth weight done.
model.layers.22 smooth weight done.
model.layers.23 smooth weight done.
model.layers.24 smooth weight done.
model.layers.25 smooth weight done.
model.layers.26 smooth weight done.
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/lmdeploy/main.py", line 5, in
不是这个,是执行命令 “lmdeploy check_env”,它会把环境信息打印出来。我们想看下在哪个环境中可以复现这个问题
Related to https://github.com/InternLM/lmdeploy/issues/1786, env is listed
不是这个,是执行命令 “lmdeploy check_env”,它会把环境信息打印出来。我们想看下在哪个环境中可以复现这个问题
执行lmdeploy check_env报错了,报错信息如下:
Traceback (most recent call last):
File "/opt/conda/bin/lmdeploy", line 8, in
Related to #1786, env is listed
可能和torch的版本有关系。我在torch2.1.0 + cu118 下也遇到了 nan 的问题,但是在 torch 2.1.2 + cu12 下是正常的。
你方便创建 cuda 12的环境试试么?
Related to #1786, env is listed
可能和torch的版本有关系。我在torch2.1.0 + cu118 下也遇到了 nan 的问题,但是在 torch 2.1.2 + cu12 下是正常的。
你方便创建 cuda 12的环境试试么?
是跟torch 版本有关,我这边相同环境,torch2.1.2 + cu118 降到 torch2.1.0 + cu118 就会 Nan。可能需要更新下发布的 docker 内的 torch 版本。
Related to #1786, env is listed
可能和torch的版本有关系。我在torch2.1.0 + cu118 下也遇到了 nan 的问题,但是在 torch 2.1.2 + cu12 下是正常的。
你方便创建 cuda 12的环境试试么?
好的,谢谢,我在cuda12环境下试下
我在cuda12.0 torch2.3.1 环境下也出现了同样错误,请问现在有更好的解决办法吗
Please try this draft PR for qwen models. https://github.com/InternLM/lmdeploy/pull/1844
@AllentDan tried that draft and failed, any new methods lately?
No, the error is occasionally raised depends on env and calibration conditions.
Checklist
Describe the bug
使用lmdeploy lite auto_awq将sft后的qwen2-7b进行awq量化,报错assert torch.isnan(p).sum() == 0
Reproduction
lmdeploy lite auto_awq qwen2-sft-checkpoint-1506-merged --calib-dataset 'c4' --calib-samples 128 --calib-seqlen 4096 --work-dir qwen2_7b_qg_2_epoch_awq
Environment
Error traceback