PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.22k stars 5.58k forks source link

训练paddleseg中的panoptic-deeplab报错 #56464

Closed Newcomer-CL closed 1 year ago

Newcomer-CL commented 1 year ago

请提出你的问题 Please ask your question

环境: cpu:飞腾D2000 加速卡:昆仑芯R200 OS:kylin v10 paddle-xpu 2.5 paddleseg 2.7.0

执行命令: python3 -m paddle.distributed.launch train.py --config configs/panoptic_deeplab/panoptic_deeplab_resnet50_os32_cityscapes_1025x513_bs8_90k_lr00005.yml --do_eval --use_vdl --save_interval 5000 --save_dir output --batch_size 4

报错: /home/zkjr/.local/lib/python3.9/site-packages/paddleseg/transforms/functional.py:105: RuntimeWarning: invalid value encountered in cast im[:, :, 0] = im[:, :, 0] + hue_delta /home/zkjr/.local/lib/python3.9/site-packages/paddle/nn/layer/norm.py:777: UserWarning: When training, we now always track global mean and variance. warnings.warn( Traceback (most recent call last): File "/home/zkjr/fangtian/PaddleSeg-2.7.0/contrib/PanopticDeepLab/train.py", line 176, in main(args) File "/home/zkjr/fangtian/PaddleSeg-2.7.0/contrib/PanopticDeepLab/train.py", line 154, in main train( File "/home/zkjr/fangtian/PaddleSeg-2.7.0/contrib/PanopticDeepLab/core/train.py", line 174, in train loss_list = loss_computation( File "/home/zkjr/fangtian/PaddleSeg-2.7.0/contrib/PanopticDeepLab/core/train.py", line 39, in loss_computation semantic_loss = losses['types'][0](logits_list[0], semantic, File "/home/zkjr/.local/lib/python3.9/site-packages/paddle/nn/layer/layers.py", line 1254, in call return self.forward(*inputs, **kwargs) File "/home/zkjr/.local/lib/python3.9/site-packages/paddleseg/models/losses/cross_entropy_loss.py", line 88, in forward return self._post_process_loss(logit, label, semantic_weights, loss) File "/home/zkjr/.local/lib/python3.9/site-packages/paddleseg/models/losses/cross_entropy_loss.py", line 132, in _post_process_loss loss, indices = paddle.topk(loss, top_k_pixels) File "/home/zkjr/.local/lib/python3.9/site-packages/paddle/tensor/search.py", line 913, in topk out, indices = _C_ops.topk(x, k, axis, largest, sorted) OSError: (External) sorted_topk XDNN Error, XDNN_INVALID_PARAM (at /workspace/Paddle/paddle/phi/kernels/xpu/top_k_kernel.cc:76)

qili93 commented 1 year ago

您好,咨询了下昆仑同学,造成这个问题的原因是“topk的size太大了,目前XDNN的api暂时还没有支持”,可以尝试输出如下环境变量 export XPU_BLACK_LIST=topk 将 topk 算子加入XPU黑名单,使其fallback到CPU上运行来解决这个问题。

Newcomer-CL commented 1 year ago

还是报错

Newcomer-CL commented 1 year ago

一样的错

qili93 commented 1 year ago

您能打开 "export GLOG_v=10 && export XPU_BLACK_LIST=topk" 跑一下然后所有的输出重定向到一个log文件之后上传一下吗?谢谢

另外也有可能是需要 “export XPU_BLACK_LIST=topk," 加个逗号试一试哈

Newcomer-CL commented 1 year ago

我修改了源码中的paddleseg/model/loss/cross_entropy_loss.py,将topk拿到cpu上计算,再将结果拿到xpu上迭代了1000轮没报错。

wangjn7 @.***

 

------------------ 原始邮件 ------------------ 发件人: "PaddlePaddle/Paddle" @.>; 发送时间: 2023年8月24日(星期四) 下午5:04 @.>; @.**@.>; 主题: Re: [PaddlePaddle/Paddle] 训练paddleseg中的panoptic-deeplab报错 (Issue #56464)

您能打开 "export GLOG_v=10 && export XPU_BLACK_LIST=topk" 跑一下然后所有的输出重定向到一个log文件之后上传一下吗?谢谢

另外也有可能是需要 “export XPU_BLACK_LIST=topk," 加个逗号试一试哈

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>