PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.14k stars 5.56k forks source link

【论文复现赛】单机多卡训练报错 #42120

Closed miemie2013 closed 1 year ago

miemie2013 commented 2 years ago

问题描述 Please describe your issue

项目地址: https://aistudio.baidu.com/aistudio/projectdetail/3848537 飞桨版本: 2.3.0rc0 使用单机4张V100训练。 进入项目后,点击页面的这3条命令安装依赖、安装自定义算子gather、解压数据集。

! cd ~/ppgan; pip install -r requirements.txt
! cd ~/ppgan/custom_ops/gather; python setup.py install
! cd ~/data/data42681/; unzip afhq.zip

接着,输入命令开启单机4卡训练

CUDA_VISIBLE_DEVICES=0,1,2,3
python -m paddle.distributed.launch --gpus 0,1,2,3 tools/main.py -c configs/stylegan_v2ada_512_afhqcat_4_gpu.yaml

会报错:

...
File "/home/aistudio/ppgan/ppgan/models/styleganv2ada_model.py", line 348, in accumulate_gradients
    (loss_Gpl.mean() * float(gain)).backward()  # 咩酱:gain即上文提到的这个阶段的训练间隔。
  File "<decorator-gen-133>", line 2, in backward
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/wrapped_decorator.py", line 25, in __impl__
    return wrapped_func(*args, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/framework.py", line 395, in __impl__
    return func(*args, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/varbase_patch_methods.py", line 290, in backward
    framework._dygraph_tracer())
NotImplementedError: (Unimplemented) Place Place(gpu:0) is not supported. Please check that your paddle compiles with WITH_GPU, WITH_XPU, WITH_IPU, WITH_MLU or WITH_ASCEND_CL option or check that your train process set the correct device id if you use Executor. (at /paddle/paddle/fluid/platform/device_context.cc:139)

代码在单机单卡上是可以跑的,希望能得到官方帮助!谢谢!

paddle-bot-old[bot] commented 2 years ago

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

LDOUBLEV commented 2 years ago

在代码主函数运行前加上 paddle.set_device('gpu') 试试

miemie2013 commented 2 years ago

在代码主函数运行前加上 paddle.set_device('gpu') 试试

在ppgan/utils/setup.py的setup()方法里已经调用过了

miemie2013 commented 2 years ago

2.2.2版本的飞桨是可以正常进行单机4卡训练的。 上面的项目,打开一个终端,输入

python -m pip install paddlepaddle-gpu==2.2.2.post101 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html

安装飞桨2.2.2

代码修改2处: ppgan/models/styleganv2_model.py的StyleGANv2ADAModel类的init()方法,将

# self.augment_pipe = None

这句代码解除注释。

将ppgan/models/generators/generator_styleganv2ada.py开头的

from custom_gather import gather_op

注释掉。即不使用StyleGANv2ADA_AugmentPipe()训练。

输入同样的命令开启训练:

CUDA_VISIBLE_DEVICES=0,1,2,3
python -m paddle.distributed.launch --gpus 0,1,2,3 tools/main.py -c configs/stylegan_v2ada_512_afhqcat_4_gpu.yaml

是可以正常训练的。但是不能用飞桨2.2.2版本训练,因为它不支持自定义外部算子的二阶导,以及paddle.grad()有重大错误,详见https://github.com/PaddlePaddle/Paddle/issues/40800 和 https://github.com/PaddlePaddle/Paddle/issues/39759 ,飞桨2.3.0rc0已经解决了这两个问题。

LielinJiang commented 2 years ago

初步确定是框架的问题,我们会尽快修复

miemie2013 commented 2 years ago

初步确定是框架的问题,我们会尽快修复

好的

LielinJiang commented 2 years ago

初步确定是框架的问题,我们会尽快修复

好的

还有想问一下,https://github.com/PaddlePaddle/Paddle/issues/40800https://github.com/PaddlePaddle/Paddle/issues/39759 这两个问题是您这边验证过已经解决了嘛,在2.3.0rc上?

miemie2013 commented 2 years ago

初步确定是框架的问题,我们会尽快修复

好的

还有想问一下,#40800 和 #39759 这两个问题是您这边验证过已经解决了嘛,在2.3.0rc上?

paddle.grad()用在stylegan2ada上没有错,但是 https://github.com/PaddlePaddle/Paddle/issues/40741 这里的代码示例,paddle.grad()依然不能和pytorch对齐

miemie2013 commented 2 years ago

初步确定是框架的问题,我们会尽快修复

好的

还有想问一下,#40800 和 #39759 这两个问题是您这边验证过已经解决了嘛,在2.3.0rc上?

paddle.grad()用在stylegan2ada上没有错,但是 #40741 这里的代码示例,paddle.grad()依然不能和pytorch对齐

这是简化的归一化的代码,stylegan2ada上的代码其实不是这里给的示例代码,只是为了说明问题。

LielinJiang commented 2 years ago

sovled by #42332

paddle-bot[bot] commented 1 year ago

Since you haven\'t replied for more than a year, we have closed this issue/pr. If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up. 由于您超过一年未回复,我们将关闭这个issue/pr。 若问题未解决或有后续问题,请随时重新打开,我们会继续跟进。