Closed Serious-H closed 2 months ago
Same problem, have you solved it?
您用的是AMD平台还是INTEL平台呀,我现在AMD平台和你也是一样的问题,关闭了IOMMU也是一样的问题
该芯片暂不支持copy_d2d,可以尝试使用训练系列芯片,或者考虑device_map="auto"改为具体的卡,例如device_map="npu:0"
这个问题好像是通信算子超时问题,昇腾310p不能这样直接transformer走torch npu推理,需要用mindie。 @Lidarker @Ycpljl 针对上面这个错误的话设置单卡运行可以避免:export ASCEND_RT_VISIBLE_DEVICES=0 ;但是推理时间几十分钟,根本没法用。
Atlas300I Pro 使用 mindIE 刚刚支持了 GLM4 ,但是推理报错 RuntimeError,权重量化代码也不行,量化代码中也没有指定卡这么一说
附权重量化代码: `# Copyright Huawei Technologies Co., Ltd. 2023-2024. All rights reserved.
from msmodelslim.pytorch.llm_ptq.llm_ptq_tools import Calibrator, QuantConfig from atb_llm.models.chatglm.config_chatglm import ChatglmConfig from examples.models.chatglm.v2_6b.quant_utils \ import get_model_and_tokenizer, get_calib_dataset, read_dataset from examples.convert.convert_utils import copy_tokenizer_files, modify_config from examples.convert.model_slim.quantifier import parse_arguments
NPU = "npu"
def main(): args = parse_arguments() fp16_path = args.model_path # 原始浮点模型路径 model, tokenizer = get_model_and_tokenizer(fp16_path, True)
quant_config = QuantConfig(
a_bit=8,
w_bit=8,
disable_names=None,
dev_type=NPU,
act_method=3,
pr=1.0,
w_sym=True,
mm_tensor=False,
use_kvcache_quant=args.use_kvcache_quant
)
calib_set = read_dataset(args.calib_file)
dataset_calib = get_calib_dataset(tokenizer, calib_set, NPU)
calibrator = Calibrator(model, quant_config, calib_data=dataset_calib, disable_level='L25')
calibrator.run() # 执行PTQ量化校准
calibrator.save(args.save_directory, save_type=["safe_tensor"]) # "safe_tensor"对应safetensors格式权重
copy_tokenizer_files(fp16_path, args.save_directory)
config = ChatglmConfig.from_pretrained(fp16_path)
modify_config(fp16_path, args.save_directory, config.torch_dtype, 'w8a8', args)
if name == 'main': main()`
===================== 执行脚本:
python quant_glm4_w8a8.py --model_path /root/workspace/GLM-4/basic_demo/THUDM/glm4-9b --save_directory ./glm4-9b_w8a8 --calib_file ./CEval/val/Other/civil_servant.jsonl --device_type npu
这里显示初始化报错,请查看plog日志,根据报错原因进行解决
一、问题现象(附报错日志上下文): RuntimeError: copy_d2d:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:104 NPU function error: c10_npu::acl::AclrtSynchronizeStreamWithTimeout(copy_stream), error code is 507013 [ERROR] 2024-09-04-02:09:03 (PID:2526494, Device:1, RankID:-1) ERR00100 PTA call acl api failed [Error]: System Direct Memory Access (DMA) hardware execution error. Rectify the fault based on the error information in the ascend log. EI9999: Inner Error! The error from device(1), serial number is 13. there is a sdma error, sdma channel is 0, the channel exist the following problems: The SMMU returns a Terminate error during page table translation.. the value of CQE status is 2. the description of CQE status: When the SQE translates a page table, the SMMU returns a Terminate error.it's config include: setting1=0xc000080880e0000, setting2=0xff009000ff004c, setting3=0, sq base addr=0x800d00801003d000[FUNC:ProcessSdmaErrorInfo][FILE:device_error_proc.cc][LINE:704] EI9999: 2024-09-04-02:09:03.196.977 Memory async copy failed, device_id=1, stream_id=3, task_id=703, flip_num=0, copy_type=2, memcpy_type=0, copy_data_type=0, length=40960[FUNC:GetError][FILE:stream.cc][LINE:1082] TraceBack (most recent call last): rtStreamSynchronizeWithTimeout execute failed, reason=[sdma copy error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53] synchronize stream failed, runtime result = 507013[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
DEVICE[1] PID[2526494]: EXCEPTION STREAM: Exception info:TGID=2526494, model id=65535, stream id=3, stream phase=3 Message info[0]:RTS_HWTS: hwts sdma error, slot_id=29, stream_id=3 Other info[0]:time=2024-09-04-02:08:53.201.667, function=int_process_hwts_sdma_error, line=2070, error code=0x20b [W compiler_depend.ts:409] Warning: NPU warning, error code is 507013[Error]: [Error]: System Direct Memory Access (DMA) hardware execution error. Rectify the fault based on the error information in the ascend log. EH9999: Inner Error! rtDeviceSynchronize execute failed, reason=[sdma copy error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53] EH9999: 2024-09-04-02:09:03.218.111 wait for compute device to finish failed, runtime result = 507013.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] TraceBack (most recent call last): (function npuSynchronizeUsedDevices) [W compiler_depend.ts:392] Warning: NPU warning, error code is 507013[Error]: [Error]: System Direct Memory Access (DMA) hardware execution error. Rectify the fault based on the error information in the ascend log. EH9999: Inner Error! rtDeviceSynchronize execute failed, reason=[sdma copy error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53] EH9999: 2024-09-04-02:09:03.220.229 wait for compute device to finish failed, runtime result = 507013.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] TraceBack (most recent call last): (function npuSynchronizeDevice) [W compiler_depend.ts:392] Warning: NPU warning, error code is 507013[Error]: [Error]: System Direct Memory Access (DMA) hardware execution error. Rectify the fault based on the error information in the ascend log. EH9999: Inner Error! rtDeviceSynchronize execute failed, reason=[sdma copy error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53] EH9999: 2024-09-04-02:09:03.222.326 wait for compute device to finish failed, runtime result = 507013.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] TraceBack (most recent call last): (function npuSynchronizeDevice) [W compiler_depend.ts:392] Warning: NPU warning, error code is 507013[Error]: [Error]: System Direct Memory Access (DMA) hardware execution error. Rectify the fault based on the error information in the ascend log. EH9999: Inner Error! rtDeviceSynchronize execute failed, reason=[sdma copy error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53] EH9999: 2024-09-04-02:09:03.224.399 wait for compute device to finish failed, runtime result = 507013.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] TraceBack (most recent call last): (function npuSynchronizeDevice) [W compiler_depend.ts:392] Warning: NPU warning, error code is 507013[Error]: [Error]: System Direct Memory Access (DMA) hardware execution error. Rectify the fault based on the error information in the ascend log. EH9999: Inner Error! rtDeviceSynchronize execute failed, reason=[sdma copy error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53] EH9999: 2024-09-04-02:09:03.226.481 wait for compute device to finish failed, runtime result = 507013.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] TraceBack (most recent call last): (function npuSynchronizeDevice) [W compiler_depend.ts:392] Warning: NPU warning, error code is 507013[Error]: [Error]: System Direct Memory Access (DMA) hardware execution error. Rectify the fault based on the error information in the ascend log. EH9999: Inner Error! rtDeviceSynchronize execute failed, reason=[sdma copy error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53] EH9999: 2024-09-04-02:09:03.228.566 wait for compute device to finish failed, runtime result = 507013.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] TraceBack (most recent call last): (function npuSynchronizeDevice) [W compiler_depend.ts:392] Warning: NPU warning, error code is 507013[Error]: [Error]: System Direct Memory Access (DMA) hardware execution error. Rectify the fault based on the error information in the ascend log. EH9999: Inner Error! rtDeviceSynchronize execute failed, reason=[sdma copy error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53] EH9999: 2024-09-04-02:09:03.230.811 wait for compute device to finish failed, runtime result = 507013.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] TraceBack (most recent call last): (function npuSynchronizeDevice) [W compiler_depend.ts:392] Warning: NPU warning, error code is 507013[Error]: [Error]: System Direct Memory Access (DMA) hardware execution error. Rectify the fault based on the error information in the ascend log. EH9999: Inner Error! rtDeviceSynchronize execute failed, reason=[sdma copy error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53] EH9999: 2024-09-04-02:09:03.233.321 wait for compute device to finish failed, runtime result = 507013.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] TraceBack (most recent call last): (function npuSynchronizeDevice) [W compiler_depend.ts:392] Warning: NPU warning, error code is 507013[Error]: [Error]: System Direct Memory Access (DMA) hardware execution error. Rectify the fault based on the error information in the ascend log. EH9999: Inner Error! rtDeviceSynchronize execute failed, reason=[sdma copy error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53] EH9999: 2024-09-04-02:09:03.235.839 wait for compute device to finish failed, runtime result = 507013.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] TraceBack (most recent call last): (function npuSynchronizeDevice)
二、软件版本: -- CANN 版本 (e.g., CANN 3.0.x,5.x.x): CANN 8.0.RC2 --Tensorflow/Pytorch/MindSpore 版本: pytorch 2.1.0 torch_npu2.1.0.post6
--Python 版本 (e.g., Python 3.7.5):Python 3.10.14 --操作系统版本 (e.g., Ubuntu 18.04): Ubuntu 20.04.5 LTS (Focal Fossa) 三、测试步骤:
四、日志信息: ascend/log/debug/plog/plog-2514126_20240904013504490.log 日志见附件 plog-2514126_20240904013504490.log