[Bug] qwen2 awq量化微调后的模型报错

qiuxuezhe123 commented 4 months ago

Checklist

[ ] 1. I have searched related issues but cannot get the expected help.
[ ] 2. The bug has not been fixed in the latest version.

Describe the bug

使用lmdeploy lite auto_awq将sft后的qwen2-7b进行awq量化，报错assert torch.isnan(p).sum() == 0

Reproduction

lmdeploy lite auto_awq qwen2-sft-checkpoint-1506-merged --calib-dataset 'c4' --calib-samples 128 --calib-seqlen 4096 --work-dir qwen2_7b_qg_2_epoch_awq

Environment

lmdeploy==0.4.1

Error traceback

Traceback (most recent call last):
  File "/opt/conda/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
  File "/opt/conda/lib/python3.8/site-packages/lmdeploy/cli/entrypoint.py", line 37, in run
    args.run(args)
  File "/opt/conda/lib/python3.8/site-packages/lmdeploy/cli/lite.py", line 131, in auto_awq
    auto_awq(**kwargs)
  File "/opt/conda/lib/python3.8/site-packages/lmdeploy/lite/apis/auto_awq.py", line 68, in auto_awq
    smooth_layers(layers, fc2fcs, norm2fcs, act_scales, w_group_size, device)
  File "/opt/conda/lib/python3.8/site-packages/lmdeploy/lite/quantization/awq.py", line 242, in smooth_layers
    smooth_ln_fcs(ln, fcs, a_scales[a_name], group_size)
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/lmdeploy/lite/quantization/awq.py", line 118, in smooth_ln_fcs
    assert torch.isnan(p).sum() == 0
AssertionError

lvhan028 commented 4 months ago

Can you paste the output of running lmdeploy check_env?

qiuxuezhe123 commented 4 months ago

Can you paste the output of running lmdeploy check_env?

下面是lmdeploy环境下运行awq量化的所有输出结果 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Loading checkpoint shards: 100%|████████████████████████████████████████████████| 4/4 [00:02<00:00, 1.84it/s] Move model.embed_tokens to GPU. Move model.layers.0 to CPU. Move model.layers.1 to CPU. Move model.layers.2 to CPU. Move model.layers.3 to CPU. Move model.layers.4 to CPU. Move model.layers.5 to CPU. Move model.layers.6 to CPU. Move model.layers.7 to CPU. Move model.layers.8 to CPU. Move model.layers.9 to CPU. Move model.layers.10 to CPU. Move model.layers.11 to CPU. Move model.layers.12 to CPU. Move model.layers.13 to CPU. Move model.layers.14 to CPU. Move model.layers.15 to CPU. Move model.layers.16 to CPU. Move model.layers.17 to CPU. Move model.layers.18 to CPU. Move model.layers.19 to CPU. Move model.layers.20 to CPU. Move model.layers.21 to CPU. Move model.layers.22 to CPU. Move model.layers.23 to CPU. Move model.layers.24 to CPU. Move model.layers.25 to CPU. Move model.layers.26 to CPU. Move model.layers.27 to CPU. Move model.norm to GPU. Move lm_head to CPU. Loading calibrate dataset ... Found cached dataset json (/root/.cache/huggingface/datasets/json/c4-3f6237ecfc2df013/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4) Found cached dataset json (/root/.cache/huggingface/datasets/json/c4-11668a7e9b799711/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4) model.layers.0, samples: 128, max gpu memory: 6.73 GB model.layers.1, samples: 128, max gpu memory: 8.48 GB model.layers.2, samples: 128, max gpu memory: 8.48 GB model.layers.3, samples: 128, max gpu memory: 8.48 GB model.layers.4, samples: 128, max gpu memory: 8.48 GB model.layers.5, samples: 128, max gpu memory: 8.48 GB model.layers.6, samples: 128, max gpu memory: 8.48 GB model.layers.7, samples: 128, max gpu memory: 8.48 GB model.layers.8, samples: 128, max gpu memory: 8.48 GB model.layers.9, samples: 128, max gpu memory: 8.48 GB model.layers.10, samples: 128, max gpu memory: 8.48 GB model.layers.11, samples: 128, max gpu memory: 8.48 GB model.layers.12, samples: 128, max gpu memory: 8.48 GB model.layers.13, samples: 128, max gpu memory: 8.48 GB model.layers.14, samples: 128, max gpu memory: 8.48 GB model.layers.15, samples: 128, max gpu memory: 8.48 GB model.layers.16, samples: 128, max gpu memory: 8.48 GB model.layers.17, samples: 128, max gpu memory: 8.48 GB model.layers.18, samples: 128, max gpu memory: 8.48 GB model.layers.19, samples: 128, max gpu memory: 8.48 GB model.layers.20, samples: 128, max gpu memory: 8.48 GB model.layers.21, samples: 128, max gpu memory: 8.48 GB model.layers.22, samples: 128, max gpu memory: 8.48 GB model.layers.23, samples: 128, max gpu memory: 8.48 GB model.layers.24, samples: 128, max gpu memory: 8.48 GB model.layers.25, samples: 128, max gpu memory: 8.48 GB model.layers.26, samples: 128, max gpu memory: 8.48 GB model.layers.27, samples: 128, max gpu memory: 8.48 GB model.layers.0 smooth weight done. model.layers.1 smooth weight done. model.layers.2 smooth weight done. model.layers.3 smooth weight done. model.layers.4 smooth weight done. model.layers.5 smooth weight done. model.layers.6 smooth weight done. model.layers.7 smooth weight done. model.layers.8 smooth weight done. model.layers.9 smooth weight done. model.layers.10 smooth weight done. model.layers.11 smooth weight done. model.layers.12 smooth weight done. model.layers.13 smooth weight done. model.layers.14 smooth weight done. model.layers.15 smooth weight done. model.layers.16 smooth weight done. model.layers.17 smooth weight done. model.layers.18 smooth weight done. model.layers.19 smooth weight done. model.layers.20 smooth weight done. model.layers.21 smooth weight done. model.layers.22 smooth weight done. model.layers.23 smooth weight done. model.layers.24 smooth weight done. model.layers.25 smooth weight done. model.layers.26 smooth weight done. Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/lmdeploy/main.py", line 5, in run() File "/opt/conda/lib/python3.8/site-packages/lmdeploy/cli/entrypoint.py", line 37, in run args.run(args) File "/opt/conda/lib/python3.8/site-packages/lmdeploy/cli/lite.py", line 137, in auto_awq auto_awq(*kwargs) File "/opt/conda/lib/python3.8/site-packages/lmdeploy/lite/apis/auto_awq.py", line 124, in auto_awq smooth_layers(layers, fc2fcs, norm2fcs, act_scales, w_group_size, File "/opt/conda/lib/python3.8/site-packages/lmdeploy/lite/quantization/awq.py", line 259, in smooth_layers smooth_ln_fcs(ln, fcs, a_scales[a_name], group_size) File "/opt/conda/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/lmdeploy/lite/quantization/awq.py", line 118, in smooth_ln_fcs assert torch.isnan(p).sum() == 0 AssertionError

lvhan028 commented 4 months ago

不是这个，是执行命令 “lmdeploy check_env”，它会把环境信息打印出来。我们想看下在哪个环境中可以复现这个问题

serser commented 4 months ago

Related to https://github.com/InternLM/lmdeploy/issues/1786, env is listed

qiuxuezhe123 commented 4 months ago

不是这个，是执行命令 “lmdeploy check_env”，它会把环境信息打印出来。我们想看下在哪个环境中可以复现这个问题

执行lmdeploy check_env报错了，报错信息如下： Traceback (most recent call last): File "/opt/conda/bin/lmdeploy", line 8, in sys.exit(run()) File "/opt/conda/lib/python3.8/site-packages/lmdeploy/cli/entrypoint.py", line 37, in run args.run(args) File "/opt/conda/lib/python3.8/site-packages/lmdeploy/cli/cli.py", line 192, in check_env env_info = collect_env() File "/opt/conda/lib/python3.8/site-packages/mmengine/utils/dl_utils/collect_env.py", line 156, in collect_env import torchvision File "/opt/conda/lib/python3.8/site-packages/torchvision/init.py", line 6, in from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils File "/opt/conda/lib/python3.8/site-packages/torchvision/_meta_registrations.py", line 164, in def meta_nms(dets, scores, iou_threshold): File "/opt/conda/lib/python3.8/site-packages/torch/_custom_ops.py", line 253, in inner custom_op = _find_custom_op(qualname, also_check_torch_library=True) File "/opt/conda/lib/python3.8/site-packages/torch/_custom_op/impl.py", line 1076, in _find_custom_op overload = get_op(qualname) File "/opt/conda/lib/python3.8/site-packages/torch/_custom_op/impl.py", line 1062, in get_op error_not_found() File "/opt/conda/lib/python3.8/site-packages/torch/_custom_op/impl.py", line 1052, in error_not_found raise ValueError( ValueError: Could not find the operator torchvision::nms. Please make sure you have already registered the operator and (if registered from C++) loaded it via torch.ops.load_library.

lvhan028 commented 4 months ago

Related to #1786, env is listed

可能和torch的版本有关系。我在torch2.1.0 + cu118 下也遇到了 nan 的问题，但是在 torch 2.1.2 + cu12 下是正常的。

你方便创建 cuda 12的环境试试么？

AllentDan commented 4 months ago

Related to #1786, env is listed

可能和torch的版本有关系。我在torch2.1.0 + cu118 下也遇到了 nan 的问题，但是在 torch 2.1.2 + cu12 下是正常的。

你方便创建 cuda 12的环境试试么？

是跟torch 版本有关，我这边相同环境，torch2.1.2 + cu118 降到 torch2.1.0 + cu118 就会 Nan。可能需要更新下发布的 docker 内的 torch 版本。

qiuxuezhe123 commented 4 months ago

Related to #1786, env is listed

可能和torch的版本有关系。我在torch2.1.0 + cu118 下也遇到了 nan 的问题，但是在 torch 2.1.2 + cu12 下是正常的。

你方便创建 cuda 12的环境试试么？

好的，谢谢，我在cuda12环境下试下

522315428 commented 2 months ago

我在cuda12.0 torch2.3.1 环境下也出现了同样错误，请问现在有更好的解决办法吗

AllentDan commented 2 months ago

Please try this draft PR for qwen models. https://github.com/InternLM/lmdeploy/pull/1844

Volta-lemon commented 2 months ago

@AllentDan tried that draft and failed, any new methods lately?

AllentDan commented 2 months ago

No, the error is occasionally raised depends on env and calibration conditions.

InternLM / lmdeploy