paddle 单机多卡分布式运行报contrib下无reader或NCCL问题

linmuchuiyang commented 2 years ago

bug描述 Describe the Bug

从NGC官网下载22.05的paddlepaddle版本：docker run --gpus all -it --rm nvcr.io/nvidia/paddlepaddle:22.05-py3 运行paddle的示例：https://github.com/PaddlePaddle/models/blob/release/1.8/dygraph/mnist/train.py 执行过程中报错如下：执行命令： python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog mnist_distribution_v1.py

WARNING 2022-06-15 14:58:25,811 launch.py:422] Not found distinct arguments and compiled with cuda or xpu. Default use collective mode
launch train in GPU mode!
INFO 2022-06-15 14:58:25,812 launch_utils.py:525] Local start 4 processes. First process distributed environment info (Only For Debug): 
    +=======================================================================================+
    |                        Distributed Envs                      Value                    |
    +---------------------------------------------------------------------------------------+
    |                       PADDLE_TRAINER_ID                        0                      |
    |                 PADDLE_CURRENT_ENDPOINT                 127.0.0.1:59873               |
    |                     PADDLE_TRAINERS_NUM                        4                      |
    |                PADDLE_TRAINER_ENDPOINTS  ... 0.1:55859,127.0.0.1:49121,127.0.0.1:38457|
    |                     PADDLE_RANK_IN_NODE                        0                      |
    |                 PADDLE_LOCAL_DEVICE_IDS                        0                      |
    |                 PADDLE_WORLD_DEVICE_IDS                     0,1,2,3                   |
    |                     FLAGS_selected_gpus                        0                      |
    |             FLAGS_selected_accelerators                        0                      |
    +=======================================================================================+

INFO 2022-06-15 14:58:25,812 launch_utils.py:530] details abouts PADDLE_TRAINER_ENDPOINTS can be found in ./mylog/endpoints.log, and detail running logs maybe found in ./mylog/workerlog.0
launch proc_id:5212 idx:0
launch proc_id:5231 idx:1
launch proc_id:5250 idx:2
launch proc_id:5270 idx:3
I0615 14:58:27.576098  5212 nccl_context.cc:74] init nccl context nranks: 4 local rank: 0 gpu id: 0 ring id: 0
W0615 14:58:29.451130  5212 device_context.cc:451] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.7, Runtime API Version: 11.7
W0615 14:58:29.460047  5212 device_context.cc:469] device: 0, cuDNN Version: 8.4.
loading mnist dataset from ./work/mnist.json.gz ...
Traceback (most recent call last):
  File "mnist_distribution_v1.py", line 107, in <module>
    train_multi_gpu()
  File "mnist_distribution_v1.py", line 84, in train_multi_gpu
    train_loader = fluid.contrib.reader.distributed_batch_reader(train_loader)
AttributeError: module 'paddle.fluid.contrib' has no attribute 'reader'
INFO 2022-06-15 14:58:46,935 launch_utils.py:320] terminate process group gid:5231
INFO 2022-06-15 14:58:46,935 launch_utils.py:320] terminate process group gid:5250
INFO 2022-06-15 14:58:46,936 launch_utils.py:320] terminate process group gid:5270
INFO 2022-06-15 14:58:50,940 launch_utils.py:341] terminate all the procs
ERROR 2022-06-15 14:58:50,940 launch_utils.py:602] ABORT!!! Out of all 4 trainers, the trainer process with rank=[0] was aborted. Please check its log.
INFO 2022-06-15 14:58:54,945 launch_utils.py:341] terminate all the procs
INFO 2022-06-15 14:58:54,945 launch.py:311] Local processes completed.

我推测是不是因为例子是1.8版本，而docker的环境是2.2.2 版本的，所以有API的不同，因而采用paddle v1到v2版本的转换器进行转换，将v1版本转换成v2之后，依然采用相同的命令执行并行计算，此次，报错如下：

INFO 2022-06-15 14:48:53,563 launch_utils.py:530] details abouts PADDLE_TRAINER_ENDPOINTS can be found in ./mylog/endpoints.log, and detail running logs maybe found in ./mylog/workerlog.0
launch proc_id:4501 idx:0
launch proc_id:4520 idx:1
launch proc_id:4539 idx:2
launch proc_id:4559 idx:3
I0615 14:48:55.291877  4501 nccl_context.cc:74] init nccl context nranks: 4 local rank: 0 gpu id: 0 ring id: 0
Traceback (most recent call last):
  File "mnist_distribution.py", line 109, in <module>
    train_multi_gpu()
  File "mnist_distribution.py", line 76, in train_multi_gpu
    strategy = paddle.fluid.dygraph.parallel.prepare_context()
  File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/parallel.py", line 68, in prepare_context
    parallel_helper._init_parallel_ctx()
  File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dygraph/parallel_helper.py", line 42, in _init_parallel_ctx
    __parallel_ctx__clz__.init()
OSError: (External) NCCL error(5), invalid usage.  Detail: Resource temporarily unavailable
Please try one of the following solutions:
1. export NCCL_SHM_DISABLE=1;
2. export NCCL_P2P_LEVEL=SYS;
3. Increase shared memory by setting the -shm-size option when starting docker container, e.g., setting  -shm-size=2g.

  [Hint: 'ncclInvalidUsage'. The call to NCCL is incorrect. This is usually reflecting a programming error.] (at /opt/paddle/paddle/paddle/fluid/platform/collective_helper.cc:99)

INFO 2022-06-15 14:49:03,684 launch_utils.py:341] terminate all the procs
ERROR 2022-06-15 14:49:03,684 launch_utils.py:602] ABORT!!! Out of all 4 trainers, the trainer process with rank=[0, 1, 2, 3] was aborted. Please check its log.
INFO 2022-06-15 14:49:07,689 launch_utils.py:341] terminate all the procs
INFO 2022-06-15 14:49:07,689 launch.py:311] Local processes completed.

按照上面的提示，我设置了两个环境变量，同时增大了docker的shm-size，依然是相同的报错，此外，我用run_check 检查了机器环境，发现GPU卡间是不能p2p的，但是fluid是通过了多GPU的测试的。

>>> fluid.install_check.run_check()
Running Verify Fluid Program ... 
W0615 14:36:55.247263  3184 device_context.cc:451] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.7, Runtime API Version: 11.7
W0615 14:36:55.254676  3184 device_context.cc:469] device: 0, cuDNN Version: 8.4.
Your Paddle Fluid works well on SINGLE GPU or CPU.
W0615 14:36:59.100487  3184 parallel_executor.cc:617] Cannot enable P2P access from 0 to 5
W0615 14:36:59.100513  3184 parallel_executor.cc:617] Cannot enable P2P access from 0 to 6
W0615 14:36:59.100518  3184 parallel_executor.cc:617] Cannot enable P2P access from 0 to 7
W0615 14:36:59.797154  3184 parallel_executor.cc:617] Cannot enable P2P access from 1 to 4
W0615 14:37:00.129410  3184 parallel_executor.cc:617] Cannot enable P2P access from 1 to 6
W0615 14:37:00.129437  3184 parallel_executor.cc:617] Cannot enable P2P access from 1 to 7
W0615 14:37:01.045341  3184 parallel_executor.cc:617] Cannot enable P2P access from 2 to 4
W0615 14:37:01.045370  3184 parallel_executor.cc:617] Cannot enable P2P access from 2 to 5
W0615 14:37:01.379971  3184 parallel_executor.cc:617] Cannot enable P2P access from 2 to 7
W0615 14:37:02.027123  3184 parallel_executor.cc:617] Cannot enable P2P access from 3 to 4
W0615 14:37:02.027153  3184 parallel_executor.cc:617] Cannot enable P2P access from 3 to 5
W0615 14:37:02.027158  3184 parallel_executor.cc:617] Cannot enable P2P access from 3 to 6
W0615 14:37:03.341859  3184 parallel_executor.cc:617] Cannot enable P2P access from 4 to 1
W0615 14:37:03.341889  3184 parallel_executor.cc:617] Cannot enable P2P access from 4 to 2
W0615 14:37:03.341893  3184 parallel_executor.cc:617] Cannot enable P2P access from 4 to 3
W0615 14:37:03.343364  3184 parallel_executor.cc:617] Cannot enable P2P access from 5 to 0
W0615 14:37:04.248374  3184 parallel_executor.cc:617] Cannot enable P2P access from 5 to 2
W0615 14:37:04.248404  3184 parallel_executor.cc:617] Cannot enable P2P access from 5 to 3
W0615 14:37:04.250039  3184 parallel_executor.cc:617] Cannot enable P2P access from 6 to 0
W0615 14:37:04.250051  3184 parallel_executor.cc:617] Cannot enable P2P access from 6 to 1
W0615 14:37:04.873052  3184 parallel_executor.cc:617] Cannot enable P2P access from 6 to 3
W0615 14:37:04.874171  3184 parallel_executor.cc:617] Cannot enable P2P access from 7 to 0
W0615 14:37:04.874182  3184 parallel_executor.cc:617] Cannot enable P2P access from 7 to 1
W0615 14:37:04.874188  3184 parallel_executor.cc:617] Cannot enable P2P access from 7 to 2
W0615 14:37:17.662714  3184 fuse_all_reduce_op_pass.cc:76] Find all_reduce operators: 2. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 1.
Your Paddle Fluid works well on MUTIPLE GPU or CPU.
Your Paddle Fluid is installed successfully! Let's start deep Learning with Paddle Fluid now

其他补充信息 Additional Supplementary Information

机器环境为8*V10016G的GPU

paddle-bot-old[bot] commented 2 years ago

您好，我们已经收到了您的问题，会安排技术人员尽快解答您的问题，请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时，您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快～

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API，FAQ，Github Issue and AI community to get the answer.Have a nice day!

ZhangHandi commented 2 years ago

hi，请问v1到v2版本的转换器是指什么呢？

linmuchuiyang commented 2 years ago

hi，请问v1到v2版本的转换器是指什么呢？

https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/model_convert/migration_cn.html

PaddlePaddle / Paddle

paddle 单机多卡分布式运行报contrib下无reader或NCCL问题 #43565

bug描述 Describe the Bug

其他补充信息 Additional Supplementary Information