PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.16k stars 5.57k forks source link

飞桨2.0rc单机多卡训练 HAPI样例报错 #28563

Closed skywalk163 closed 3 years ago

skywalk163 commented 3 years ago

飞桨版本:2.0rc 硬件是双T4 系统是centos 7 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.118.02 Driver Version: 440.118.02 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla T4 Off | 00000000:21:01.0 Off | 0 | | N/A 39C P0 26W / 70W | 3185MiB / 15109MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla T4 Off | 00000000:21:02.0 Off | 0 | | N/A 40C P0 26W / 70W | 974MiB / 15109MiB | 0% Default | +-------------------------------+----------------------+----------------------+

(base) [skywalk@ecs-ai ~]$ python Python 3.7.6 (default, Jan 8 2020, 19:59:22) [GCC 7.3.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information.

import paddle paddle.fluid.install_check.run_check() Running Verify Fluid Program ... W1112 10:12:43.584758 12534 device_context.cc:338] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 10.2, Runtime API Version: 10.2 W1112 10:12:43.588452 12534 device_context.cc:346] device: 0, cuDNN Version: 7.6. Your Paddle Fluid works well on SINGLE GPU or CPU. W1112 10:12:48.169303 12534 fuse_all_reduce_op_pass.cc:75] Find all_reduce operators: 2. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 1. Your Paddle Fluid works well on MUTIPLE GPU or CPU. Your Paddle Fluid is installed successfully! Let's start deep Learning with Paddle Fluid now

参考文档:https://www.paddlepaddle.org.cn/documentation/docs/zh/2.0-rc/guides/01_paddle2.0_introduction/upgrade_guide_cn.html#spawn

参考文档里spawn部分的代码,发现用高阶api代码的时候,会报错: (base) [skywalk@ecs-ai chapter08_computational-performance]$ python hapispawn.py W1112 09:54:58.107329 11213 device_context.cc:338] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 10.2, Runtime API Version: 10.2 W1112 09:54:58.110316 11213 device_context.cc:346] device: 0, cuDNN Version: 7.6. W1112 09:55:04.950773 11244 device_context.cc:338] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 10.2, Runtime API Version: 10.2 W1112 09:55:04.953727 11244 device_context.cc:346] device: 0, cuDNN Version: 7.6. W1112 09:55:04.956311 11243 device_context.cc:338] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 10.2, Runtime API Version: 10.2 W1112 09:55:04.959161 11243 device_context.cc:346] device: 0, cuDNN Version: 7.6. Epoch 1/1 /home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working return (isinstance(seq, collections.Sequence) and /home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working return (isinstance(seq, collections.Sequence) and Traceback (most recent call last): File "hapispawn.py", line 25, in dist.spawn(train) File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 407, in spawn while not context.join(): File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 210, in join self._throw_exception(error_index) File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 228, in _throw_exception raise Exception(msg) Exception:


Process 1 terminated with the following error:

Traceback (most recent call last): File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 159, in _func_wrapper result = func(args) File "/mnt/work/work/Dive-into-DL-PyTorch/code/chapter08_computational-performance/hapispawn.py", line 19, in train model.fit(train_dataset, epochs=1, batch_size=64, log_freq=400) File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/hapi/model.py", line 1469, in fit logs = self._run_one_epoch(train_loader, cbks, 'train') File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/hapi/model.py", line 1839, in _run_one_epoch data[len(self._inputs):]) File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/hapi/model.py", line 937, in train_batch loss = self._adapter.train_batch(inputs, labels) File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/hapi/model.py", line 659, in train_batch self.model._optimizer.minimize(final_loss) File "", line 2, in minimize File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/fluid/dygraph/base.py", line 253, in impl return func(args, *kwargs) File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/optimizer/optimizer.py", line 874, in minimize no_grad_set=no_grad_set) File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/optimizer/optimizer.py", line 676, in backward apply_collective_grads(parameter_list) File "", line 2, in apply_collective_grads File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/fluid/dygraph/base.py", line 253, in impl return func(args, **kwargs) File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel.py", line 330, in apply_collective_grads coalesced_grad._allreduce(strategy) paddle.fluid.core_avx.EnforceNotMet:


C++ Traceback (most recent call last):

0 paddle::imperative::AllReduce(paddle::framework::Variable const&, paddle::framework::Variable, paddle::imperative::ParallelStrategy const&) 1 paddle::imperative::AllReduce(paddle::framework::Variable const&, paddle::framework::Variable, paddle::imperative::ParallelStrategy const&, CUstream_st) 2 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const, int) 3 std::string paddle::platform::GetTraceBackString<char const>(char const&&, char const*, int) 4 paddle::platform::GetCurrentTraceBackString[abi:cxx11]()


Error Message Summary:

ExternalError: Nccl error, invalid argument (at /paddle/paddle/fluid/imperative/all_reduce.cc:49)

(base) [skywalk@ecs-ai chapter08_computational-performance]$ client_loop: send disconnect: Broken pipe

hapispawn.py文件代码:

    import paddle

import paddle.distributed as dist

    train_dataset = paddle.vision.datasets.MNIST(mode='train')
    test_dataset = paddle.vision.datasets.MNIST(mode='test')
    lenet = paddle.vision.models.LeNet()

    # Mnist继承paddle.nn.Layer属于Net,model包含了训练功能
    model = paddle.Model(lenet)

    # 设置训练模型所需的optimizer, loss, metric
    model.prepare(
        paddle.optimizer.Adam(learning_rate=0.001, parameters=model.parameters()),
        paddle.nn.CrossEntropyLoss(),
        paddle.metric.Accuracy(topk=(1, 2))
        )
    def train():
        # 启动训练
        model.fit(train_dataset, epochs=1, batch_size=64, log_freq=400)

        # 启动评估
    #     model.evaluate(test_dataset, log_freq=20, batch_size=64)

    if __name__ == '__main__':
        dist.spawn(train)
paddle-bot-old[bot] commented 3 years ago

您好,我们已经收到了您的问题,会安排技术人员在一天之内解答您的疑惑,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. The average response time is expected to be with in one day. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

JZ-LIANG commented 3 years ago

在运行前 export NCCL_DEBUG=INFO ,打印一下 nccl 的日志

luotao1 commented 3 years ago

从C++栈错误信息看,和 #28484 非常类似,可以参考该PR中的解决方法。 export NCCL_SHM_DISABLE=1

skywalk163 commented 3 years ago

看了一下,好像没有看到nccl的相关信息。

    (base) [skywalk@ecs-ai chapter08_computational-performance]$ export  NCCL_DEBUG=INFO
    (base) [skywalk@ecs-ai chapter08_computational-performance]$ echo $NCCL_DEBUG
    INFO
    (base) [skywalk@ecs-ai chapter08_computational-performance]$ python hapispawn.py 
    W1112 13:31:43.235133 28034 device_context.cc:338] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 10.2, Runtime API Version: 10.2
    W1112 13:31:43.238111 28034 device_context.cc:346] device: 0, cuDNN Version: 7.6.
    W1112 13:31:50.106040 28062 device_context.cc:338] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 10.2, Runtime API Version: 10.2
    W1112 13:31:50.109007 28062 device_context.cc:346] device: 0, cuDNN Version: 7.6.
    W1112 13:31:50.190696 28061 device_context.cc:338] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 10.2, Runtime API Version: 10.2
    W1112 13:31:50.193691 28061 device_context.cc:346] device: 0, cuDNN Version: 7.6.
    /home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
      return (isinstance(seq, collections.Sequence) and
    Epoch 1/1
    /home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
      return (isinstance(seq, collections.Sequence) and
    Traceback (most recent call last):
      File "hapispawn.py", line 25, in <module>
        dist.spawn(train)
      File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 407, in spawn
        while not context.join():
      File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 210, in join
        self._throw_exception(error_index)
      File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 228, in _throw_exception
        raise Exception(msg)
    Exception: 

    ----------------------------------------------
    Process 1 terminated with the following error:
    ----------------------------------------------

    Traceback (most recent call last):
      File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 159, in _func_wrapper
        result = func(*args)
      File "/mnt/work/work/Dive-into-DL-PyTorch/code/chapter08_computational-performance/hapispawn.py", line 19, in train
        model.fit(train_dataset, epochs=1, batch_size=64, log_freq=400)
      File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/hapi/model.py", line 1469, in fit
        logs = self._run_one_epoch(train_loader, cbks, 'train')
      File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/hapi/model.py", line 1839, in _run_one_epoch
        data[len(self._inputs):])
      File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/hapi/model.py", line 937, in train_batch
        loss = self._adapter.train_batch(inputs, labels)
      File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/hapi/model.py", line 659, in train_batch
        self.model._optimizer.minimize(final_loss)
      File "<decorator-gen-175>", line 2, in minimize
      File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/fluid/dygraph/base.py", line 253, in __impl__
        return func(*args, **kwargs)
      File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/optimizer/optimizer.py", line 874, in minimize
        no_grad_set=no_grad_set)
      File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/optimizer/optimizer.py", line 676, in backward
        apply_collective_grads(parameter_list)
      File "<decorator-gen-41>", line 2, in apply_collective_grads
      File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/fluid/dygraph/base.py", line 253, in __impl__
        return func(*args, **kwargs)
      File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel.py", line 330, in apply_collective_grads
        coalesced_grad._allreduce(strategy)
    paddle.fluid.core_avx.EnforceNotMet: 

    --------------------------------------
    C++ Traceback (most recent call last):
    --------------------------------------
    0   paddle::imperative::AllReduce(paddle::framework::Variable const&, paddle::framework::Variable*, paddle::imperative::ParallelStrategy const&)
    1   paddle::imperative::AllReduce(paddle::framework::Variable const&, paddle::framework::Variable*, paddle::imperative::ParallelStrategy const&, CUstream_st*)
    2   paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int)
    3   std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int)
    4   paddle::platform::GetCurrentTraceBackString[abi:cxx11]()

    ----------------------
    Error Message Summary:
    ----------------------
    ExternalError:  Nccl error, invalid argument  (at /paddle/paddle/fluid/imperative/all_reduce.cc:49)

说明一下,用普通api测试是可以过的。 (base) [skywalk@ecs-ai chapter08_computational-performance]$ python spawn.py server not ready, wait 3 sec to retry... not ready endpoints:['127.0.0.1:59802'] server not ready, wait 3 sec to retry... not ready endpoints:['127.0.0.1:59802'] server not ready, wait 3 sec to retry... not ready endpoints:['127.0.0.1:59802'] server not ready, wait 3 sec to retry... not ready endpoints:['127.0.0.1:59802'] ecs-ai:18739:18739 [0] NCCL INFO Bootstrap : Using [0]eth0:192.168.1.160<0> ecs-ai:18739:18739 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation ecs-ai:18739:18739 [0] NCCL INFO NET/IB : No device found. ecs-ai:18739:18739 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.1.160<0> I1112 19:08:09.684443 18739 nccl_context.cc:160] init nccl context nranks: 2 local rank: 0 gpu id: 0 NCCL version 2.5.6+cuda9.0 I1112 19:08:09.684478 18740 nccl_context.cc:160] init nccl context nranks: 2 local rank: 1 gpu id: 1 ecs-ai:18740:18740 [1] NCCL INFO Bootstrap : Using [0]eth0:192.168.1.160<0> ecs-ai:18740:18740 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation ecs-ai:18740:18740 [1] NCCL INFO NET/IB : No device found. ecs-ai:18740:18740 [1] NCCL INFO NET/Socket : Using [0]eth0:192.168.1.160<0> ecs-ai:18740:18740 [1] NCCL INFO Setting affinity for GPU 1 to ffff ecs-ai:18739:18739 [0] NCCL INFO Setting affinity for GPU 0 to ffff ecs-ai:18739:18739 [0] NCCL INFO Channel 00/02 : 0 1 ecs-ai:18740:18740 [1] NCCL INFO Threads per block : 512/640/256 ecs-ai:18739:18739 [0] NCCL INFO Channel 01/02 : 0 1 ecs-ai:18740:18740 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 ecs-ai:18739:18739 [0] NCCL INFO Threads per block : 512/640/256 ecs-ai:18740:18740 [1] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] -1/-1/-1->1->0|0->1->-1/-1/-1 ecs-ai:18739:18739 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 ecs-ai:18739:18739 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->-1|-1->0->1/-1/-1 ecs-ai:18740:18740 [1] NCCL INFO Ring 00 : 1[21020] -> 0[21010] via P2P/IPC ecs-ai:18739:18739 [0] NCCL INFO Ring 00 : 0[21010] -> 1[21020] via P2P/IPC ecs-ai:18740:18740 [1] NCCL INFO Ring 01 : 1[21020] -> 0[21010] via P2P/IPC ecs-ai:18739:18739 [0] NCCL INFO Ring 01 : 0[21010] -> 1[21020] via P2P/IPC ecs-ai:18740:18740 [1] NCCL INFO comm 0x56545a3527f0 rank 1 nranks 2 cudaDev 1 busId 21020 - Init COMPLETE W1112 19:08:09.958758 18740 device_context.cc:338] Please NOTE: device: 1, CUDA Capability: 75, Driver API Version: 10.2, Runtime API Version: 10.2 ecs-ai:18739:18739 [0] NCCL INFO comm 0x555af85c2d20 rank 0 nranks 2 cudaDev 0 busId 21010 - Init COMPLETE W1112 19:08:09.959158 18739 device_context.cc:338] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 10.2, Runtime API Version: 10.2 W1112 19:08:09.961978 18740 device_context.cc:346] device: 1, cuDNN Version: 7.6. W1112 19:08:09.962162 18739 device_context.cc:346] device: 0, cuDNN Version: 7.6. ecs-ai:18739:18739 [0] NCCL INFO Launch mode Parallel epoch: 0, batch_id: 0, loss is: [49.067875], acc is: [0.140625] epoch: 0, batch_id: 0, loss is: [51.128468], acc is: [0.09375] epoch: 0, batch_id: 400, loss is: [0.21057084], acc is: [0.96875] epoch: 0, batch_id: 400, loss is: [0.31912962], acc is: [0.96875] epoch: 0, batch_id: 800, loss is: [0.05543002], acc is: [1.] epoch: 0, batch_id: 800, loss is: [0.27797294], acc is: [0.96875] (base) [skywalk@ecs-ai chapter08_computational-performance]$ cat spawn.py

    import paddle #这是有3处改动的版本
    import paddle.distributed as dist #第1处改动,import库

    train_dataset = paddle.vision.datasets.MNIST(mode='train')
    test_dataset = paddle.vision.datasets.MNIST(mode='test')

    # 加载训练集 batch_size 设为 64
    train_loader = paddle.io.DataLoader(train_dataset, places=paddle.CPUPlace(), batch_size=64, shuffle=True)

    def train():
        # 第2处改动,初始化并行环境
        dist.init_parallel_env()

        # 第3处改动,增加paddle.DataParallel封装
        net = paddle.DataParallel(paddle.vision.models.LeNet()) #手册这里没有写全LeNet的库路径
        epochs = 1
        adam = paddle.optimizer.Adam(learning_rate=0.001, parameters=net.parameters())
        # 用Adam作为优化函数
        for epoch in range(epochs):
            for batch_id, data in enumerate(train_loader()):
                x_data = data[0]
                y_data = data[1]
                predicts = net(x_data)  
                acc = paddle.metric.accuracy(predicts, y_data, k=2)
                avg_acc = paddle.mean(acc)
                loss = paddle.nn.functional.cross_entropy(predicts, y_data, reduction='mean') 
                loss.backward() #这里手册误写成了avg_loss
                if batch_id % 400 == 0:
                    print("epoch: {}, batch_id: {}, loss is: {}, acc is: {}".format(epoch, batch_id, loss.numpy(), avg_acc.numpy())) #这里手册误写成了avg_loss
                adam.step()
                adam.clear_grad()
    # 启动train多进程训练,默认使用所有可见的GPU卡            
    if __name__ == '__main__':
        dist.spawn(train)
skywalk163 commented 3 years ago

export NCCL_SHM_DISABLE=1 之后也不行。

skywalk163 commented 3 years ago

目前没有双T4环境了,先关闭吧。

paddle-bot-old[bot] commented 3 years ago

Are you satisfied with the resolution of your issue?

YES No