xpu训练报错device memory not enough

umbraclet16 commented 3 years ago

描述问题/Describe the bug

国产化环境使用XPU训练，训练几步后报错: ioctl() fail, (705) Device memory not enough batch_size设为1仍然不能正常训练。 xpu_smi大部分时间显示memory在4000 / 8064MB以内。

复现/Reproduction

python tools/train.py -c yolov3_mobilenet_v1.yml -o use_gpu=false worker_num=1 batch_size=1 --eval

您是否更改过代码或配置文件？您是否理解您所更改的内容？还请您提供所更改的部分代码。/Did you make any modifications on the code or config? Did you understand what you have modified? Please provide the codes that you modified. 修改了worker_num和batch_size.
您使用的数据集是？/What dataset did you use? coco格式数据集
请提供您出现的报错信息及相关log。/Please provide the error messages or relevant log information. Error Message Summary： ExternalError: XPU conv kernel return wrong value[3 xpu api no enough workspace]

环境/Environment

请提供您使用的Paddle和PaddleDetection的版本号/Please provide the version of Paddle and PaddleDetection you use： paddlepaddle=2.1.1 paddledetection=2.1
如您在使用PaddleDetection的同时还在使用其他产品，如PaddleServing、PaddleInference等，请您提供其版本号/ Please provide the version of any other related tools/products used, such as the version of PaddleServing and etc：无
请提供您使用的操作系统信息，如Linux/Windows/MacOS /Please provide the OS information, e.g., Linux： Kylin V10 linux master 4.19.90-17.ky10.aarch64
请问您使用的Python版本是？/ Please provide the version of Python you used. 3.7.4
请问您使用的CUDA/cuDNN的版本号是？/ Please provide the version of CUDA/cuDNN you used. 无

qingqing01 commented 3 years ago

可以试下静态图的昆仑 XPU训练吗？参考文档，提供了几个验证过的配置 https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.2/static/docs/tutorials/train_on_kunlun.md

umbraclet16 commented 3 years ago

测试了静态图： yolov3_mobilenet_v1_voc.yml，bs=4； yolov3_darknet_roadsign_kunlun.yml, bs=2 可以训练。换了张卡跑动态图yolov3_mobilenetv1，bs改为4，可以训练。估计是之前配置没改成功，应该用-o TrainReader.batch_size。

再尝试faster rcnn r50，使用动态图和已验证kunlun的静态图配置仍然内存不足，bs本来就是1，没办法再减小了。。

另外，xpu是否支持多卡训练？执行命令 python -m paddle.distributed.launch --xpus 0,1 tools/train.py -c configs/xx.yml -o use_gpu=false use_xpu=true

动态图报错： Operator (broadcast) is not registered. 静态图报错： workerlog.1: ValueError: Operator "gen_nccl_id" has not been registered.

QingshuChen commented 3 years ago

xpu支持多卡训练，可以拉最新的paddle编译跑下看看。

qingqing01 commented 3 years ago

您使用的XPU内存多大？静态图的话可以加 export FLAGS_eager_delete_tensor_gb=0 试下

umbraclet16 commented 3 years ago

您使用的XPU内存多大？静态图的话可以加 export FLAGS_eager_delete_tensor_gb=0 试下

K200，8G吧。之前在动态图试过这个环境变量，没效果。静态图加上能跑个一二十步，还是会崩。

umbraclet16 commented 3 years ago

xpu支持多卡训练，可以拉最新的paddle编译跑下看看。

请问有官方的aarch64+xpu的whl包吗？我们是非互联网环境，比较担心源码编译缺东西或者有其他需要联网的情况。之前编译过paddlelite就需要在编译过程中下载文件。

qingqing01 commented 3 years ago

https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/09_hardware_support/xpu_docs/paddle_install_cn.html

参考文档，可以联系飞桨官方邮件组：Paddle-better@baidu.com 获取whl包

umbraclet16 commented 3 years ago

另外想问下在国产化环境用xpu跑faster rcnn推理时报错： InvalidArgumentError: The Variable type must be N6paddle9framework9LoDTensorE, but the type it holds is St6vectorIN6paddle9framework9LoDTensorESaIS2_EE. 用cpu/gpu跑都没有问题，导出的模型本身应该没问题。推理调用的是deploy/python/infer.py的Detector.predict().

QingshuChen commented 3 years ago

昆仑的OP支持没有cpu/gpu那么完善，该参数模式不支持。

paddle-bot-old[bot] commented 2 years ago

Since this issue has not been updated for more than three months, it will be closed, if it is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up. It is recommended to pull and try the latest code first. 由于该问题超过三个月未更新，将会被关闭，若问题未解决或有后续问题，请随时重新打开（建议先拉取最新代码进行尝试），我们会继续跟进。

PaddlePaddle / PaddleDetection