Closed skywalk163 closed 3 years ago
您好,我们已经收到了您的问题,会安排技术人员在一天之内解答您的疑惑,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快~
Hi! We've received your issue and please be patient to get responded. The average response time is expected to be with in one day. Please make sure that you have posted enough message to demo your request. You may also check out the API,FAQ,Github Issue and AI community to get the answer.Have a nice day!
在运行前 export NCCL_DEBUG=INFO ,打印一下 nccl 的日志
从C++栈错误信息看,和 #28484 非常类似,可以参考该PR中的解决方法。
export NCCL_SHM_DISABLE=1
。
看了一下,好像没有看到nccl的相关信息。
(base) [skywalk@ecs-ai chapter08_computational-performance]$ export NCCL_DEBUG=INFO
(base) [skywalk@ecs-ai chapter08_computational-performance]$ echo $NCCL_DEBUG
INFO
(base) [skywalk@ecs-ai chapter08_computational-performance]$ python hapispawn.py
W1112 13:31:43.235133 28034 device_context.cc:338] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 10.2, Runtime API Version: 10.2
W1112 13:31:43.238111 28034 device_context.cc:346] device: 0, cuDNN Version: 7.6.
W1112 13:31:50.106040 28062 device_context.cc:338] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 10.2, Runtime API Version: 10.2
W1112 13:31:50.109007 28062 device_context.cc:346] device: 0, cuDNN Version: 7.6.
W1112 13:31:50.190696 28061 device_context.cc:338] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 10.2, Runtime API Version: 10.2
W1112 13:31:50.193691 28061 device_context.cc:346] device: 0, cuDNN Version: 7.6.
/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
return (isinstance(seq, collections.Sequence) and
Epoch 1/1
/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
return (isinstance(seq, collections.Sequence) and
Traceback (most recent call last):
File "hapispawn.py", line 25, in <module>
dist.spawn(train)
File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 407, in spawn
while not context.join():
File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 210, in join
self._throw_exception(error_index)
File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 228, in _throw_exception
raise Exception(msg)
Exception:
----------------------------------------------
Process 1 terminated with the following error:
----------------------------------------------
Traceback (most recent call last):
File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 159, in _func_wrapper
result = func(*args)
File "/mnt/work/work/Dive-into-DL-PyTorch/code/chapter08_computational-performance/hapispawn.py", line 19, in train
model.fit(train_dataset, epochs=1, batch_size=64, log_freq=400)
File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/hapi/model.py", line 1469, in fit
logs = self._run_one_epoch(train_loader, cbks, 'train')
File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/hapi/model.py", line 1839, in _run_one_epoch
data[len(self._inputs):])
File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/hapi/model.py", line 937, in train_batch
loss = self._adapter.train_batch(inputs, labels)
File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/hapi/model.py", line 659, in train_batch
self.model._optimizer.minimize(final_loss)
File "<decorator-gen-175>", line 2, in minimize
File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/fluid/dygraph/base.py", line 253, in __impl__
return func(*args, **kwargs)
File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/optimizer/optimizer.py", line 874, in minimize
no_grad_set=no_grad_set)
File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/optimizer/optimizer.py", line 676, in backward
apply_collective_grads(parameter_list)
File "<decorator-gen-41>", line 2, in apply_collective_grads
File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/fluid/dygraph/base.py", line 253, in __impl__
return func(*args, **kwargs)
File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel.py", line 330, in apply_collective_grads
coalesced_grad._allreduce(strategy)
paddle.fluid.core_avx.EnforceNotMet:
--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0 paddle::imperative::AllReduce(paddle::framework::Variable const&, paddle::framework::Variable*, paddle::imperative::ParallelStrategy const&)
1 paddle::imperative::AllReduce(paddle::framework::Variable const&, paddle::framework::Variable*, paddle::imperative::ParallelStrategy const&, CUstream_st*)
2 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int)
3 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int)
4 paddle::platform::GetCurrentTraceBackString[abi:cxx11]()
----------------------
Error Message Summary:
----------------------
ExternalError: Nccl error, invalid argument (at /paddle/paddle/fluid/imperative/all_reduce.cc:49)
说明一下,用普通api测试是可以过的。 (base) [skywalk@ecs-ai chapter08_computational-performance]$ python spawn.py server not ready, wait 3 sec to retry... not ready endpoints:['127.0.0.1:59802'] server not ready, wait 3 sec to retry... not ready endpoints:['127.0.0.1:59802'] server not ready, wait 3 sec to retry... not ready endpoints:['127.0.0.1:59802'] server not ready, wait 3 sec to retry... not ready endpoints:['127.0.0.1:59802'] ecs-ai:18739:18739 [0] NCCL INFO Bootstrap : Using [0]eth0:192.168.1.160<0> ecs-ai:18739:18739 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation ecs-ai:18739:18739 [0] NCCL INFO NET/IB : No device found. ecs-ai:18739:18739 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.1.160<0> I1112 19:08:09.684443 18739 nccl_context.cc:160] init nccl context nranks: 2 local rank: 0 gpu id: 0 NCCL version 2.5.6+cuda9.0 I1112 19:08:09.684478 18740 nccl_context.cc:160] init nccl context nranks: 2 local rank: 1 gpu id: 1 ecs-ai:18740:18740 [1] NCCL INFO Bootstrap : Using [0]eth0:192.168.1.160<0> ecs-ai:18740:18740 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation ecs-ai:18740:18740 [1] NCCL INFO NET/IB : No device found. ecs-ai:18740:18740 [1] NCCL INFO NET/Socket : Using [0]eth0:192.168.1.160<0> ecs-ai:18740:18740 [1] NCCL INFO Setting affinity for GPU 1 to ffff ecs-ai:18739:18739 [0] NCCL INFO Setting affinity for GPU 0 to ffff ecs-ai:18739:18739 [0] NCCL INFO Channel 00/02 : 0 1 ecs-ai:18740:18740 [1] NCCL INFO Threads per block : 512/640/256 ecs-ai:18739:18739 [0] NCCL INFO Channel 01/02 : 0 1 ecs-ai:18740:18740 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 ecs-ai:18739:18739 [0] NCCL INFO Threads per block : 512/640/256 ecs-ai:18740:18740 [1] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] -1/-1/-1->1->0|0->1->-1/-1/-1 ecs-ai:18739:18739 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 ecs-ai:18739:18739 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->-1|-1->0->1/-1/-1 ecs-ai:18740:18740 [1] NCCL INFO Ring 00 : 1[21020] -> 0[21010] via P2P/IPC ecs-ai:18739:18739 [0] NCCL INFO Ring 00 : 0[21010] -> 1[21020] via P2P/IPC ecs-ai:18740:18740 [1] NCCL INFO Ring 01 : 1[21020] -> 0[21010] via P2P/IPC ecs-ai:18739:18739 [0] NCCL INFO Ring 01 : 0[21010] -> 1[21020] via P2P/IPC ecs-ai:18740:18740 [1] NCCL INFO comm 0x56545a3527f0 rank 1 nranks 2 cudaDev 1 busId 21020 - Init COMPLETE W1112 19:08:09.958758 18740 device_context.cc:338] Please NOTE: device: 1, CUDA Capability: 75, Driver API Version: 10.2, Runtime API Version: 10.2 ecs-ai:18739:18739 [0] NCCL INFO comm 0x555af85c2d20 rank 0 nranks 2 cudaDev 0 busId 21010 - Init COMPLETE W1112 19:08:09.959158 18739 device_context.cc:338] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 10.2, Runtime API Version: 10.2 W1112 19:08:09.961978 18740 device_context.cc:346] device: 1, cuDNN Version: 7.6. W1112 19:08:09.962162 18739 device_context.cc:346] device: 0, cuDNN Version: 7.6. ecs-ai:18739:18739 [0] NCCL INFO Launch mode Parallel epoch: 0, batch_id: 0, loss is: [49.067875], acc is: [0.140625] epoch: 0, batch_id: 0, loss is: [51.128468], acc is: [0.09375] epoch: 0, batch_id: 400, loss is: [0.21057084], acc is: [0.96875] epoch: 0, batch_id: 400, loss is: [0.31912962], acc is: [0.96875] epoch: 0, batch_id: 800, loss is: [0.05543002], acc is: [1.] epoch: 0, batch_id: 800, loss is: [0.27797294], acc is: [0.96875] (base) [skywalk@ecs-ai chapter08_computational-performance]$ cat spawn.py
import paddle #这是有3处改动的版本
import paddle.distributed as dist #第1处改动,import库
train_dataset = paddle.vision.datasets.MNIST(mode='train')
test_dataset = paddle.vision.datasets.MNIST(mode='test')
# 加载训练集 batch_size 设为 64
train_loader = paddle.io.DataLoader(train_dataset, places=paddle.CPUPlace(), batch_size=64, shuffle=True)
def train():
# 第2处改动,初始化并行环境
dist.init_parallel_env()
# 第3处改动,增加paddle.DataParallel封装
net = paddle.DataParallel(paddle.vision.models.LeNet()) #手册这里没有写全LeNet的库路径
epochs = 1
adam = paddle.optimizer.Adam(learning_rate=0.001, parameters=net.parameters())
# 用Adam作为优化函数
for epoch in range(epochs):
for batch_id, data in enumerate(train_loader()):
x_data = data[0]
y_data = data[1]
predicts = net(x_data)
acc = paddle.metric.accuracy(predicts, y_data, k=2)
avg_acc = paddle.mean(acc)
loss = paddle.nn.functional.cross_entropy(predicts, y_data, reduction='mean')
loss.backward() #这里手册误写成了avg_loss
if batch_id % 400 == 0:
print("epoch: {}, batch_id: {}, loss is: {}, acc is: {}".format(epoch, batch_id, loss.numpy(), avg_acc.numpy())) #这里手册误写成了avg_loss
adam.step()
adam.clear_grad()
# 启动train多进程训练,默认使用所有可见的GPU卡
if __name__ == '__main__':
dist.spawn(train)
export NCCL_SHM_DISABLE=1 之后也不行。
目前没有双T4环境了,先关闭吧。
飞桨版本:2.0rc 硬件是双T4 系统是centos 7 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.118.02 Driver Version: 440.118.02 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla T4 Off | 00000000:21:01.0 Off | 0 | | N/A 39C P0 26W / 70W | 3185MiB / 15109MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla T4 Off | 00000000:21:02.0 Off | 0 | | N/A 40C P0 26W / 70W | 974MiB / 15109MiB | 0% Default | +-------------------------------+----------------------+----------------------+
(base) [skywalk@ecs-ai ~]$ python Python 3.7.6 (default, Jan 8 2020, 19:59:22) [GCC 7.3.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information.
参考文档:https://www.paddlepaddle.org.cn/documentation/docs/zh/2.0-rc/guides/01_paddle2.0_introduction/upgrade_guide_cn.html#spawn
参考文档里spawn部分的代码,发现用高阶api代码的时候,会报错: (base) [skywalk@ecs-ai chapter08_computational-performance]$ python hapispawn.py W1112 09:54:58.107329 11213 device_context.cc:338] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 10.2, Runtime API Version: 10.2 W1112 09:54:58.110316 11213 device_context.cc:346] device: 0, cuDNN Version: 7.6. W1112 09:55:04.950773 11244 device_context.cc:338] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 10.2, Runtime API Version: 10.2 W1112 09:55:04.953727 11244 device_context.cc:346] device: 0, cuDNN Version: 7.6. W1112 09:55:04.956311 11243 device_context.cc:338] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 10.2, Runtime API Version: 10.2 W1112 09:55:04.959161 11243 device_context.cc:346] device: 0, cuDNN Version: 7.6. Epoch 1/1 /home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working return (isinstance(seq, collections.Sequence) and /home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working return (isinstance(seq, collections.Sequence) and Traceback (most recent call last): File "hapispawn.py", line 25, in
dist.spawn(train)
File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 407, in spawn
while not context.join():
File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 210, in join
self._throw_exception(error_index)
File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 228, in _throw_exception
raise Exception(msg)
Exception:
Process 1 terminated with the following error:
Traceback (most recent call last): File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 159, in _func_wrapper result = func(args) File "/mnt/work/work/Dive-into-DL-PyTorch/code/chapter08_computational-performance/hapispawn.py", line 19, in train model.fit(train_dataset, epochs=1, batch_size=64, log_freq=400) File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/hapi/model.py", line 1469, in fit logs = self._run_one_epoch(train_loader, cbks, 'train') File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/hapi/model.py", line 1839, in _run_one_epoch data[len(self._inputs):]) File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/hapi/model.py", line 937, in train_batch loss = self._adapter.train_batch(inputs, labels) File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/hapi/model.py", line 659, in train_batch self.model._optimizer.minimize(final_loss) File "", line 2, in minimize
File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/fluid/dygraph/base.py", line 253, in impl
return func( args, *kwargs)
File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/optimizer/optimizer.py", line 874, in minimize
no_grad_set=no_grad_set)
File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/optimizer/optimizer.py", line 676, in backward
apply_collective_grads(parameter_list)
File "", line 2, in apply_collective_grads
File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/fluid/dygraph/base.py", line 253, in impl
return func( args, **kwargs)
File "/home/skywalk/anaconda3/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel.py", line 330, in apply_collective_grads
coalesced_grad._allreduce(strategy)
paddle.fluid.core_avx.EnforceNotMet:
C++ Traceback (most recent call last):
0 paddle::imperative::AllReduce(paddle::framework::Variable const&, paddle::framework::Variable, paddle::imperative::ParallelStrategy const&) 1 paddle::imperative::AllReduce(paddle::framework::Variable const&, paddle::framework::Variable, paddle::imperative::ParallelStrategy const&, CUstream_st) 2 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const, int) 3 std::string paddle::platform::GetTraceBackString<char const>(char const&&, char const*, int) 4 paddle::platform::GetCurrentTraceBackString[abi:cxx11]()
Error Message Summary:
ExternalError: Nccl error, invalid argument (at /paddle/paddle/fluid/imperative/all_reduce.cc:49)
(base) [skywalk@ecs-ai chapter08_computational-performance]$ client_loop: send disconnect: Broken pipe
hapispawn.py文件代码: