Crash when trying to add two ndarrays on different GPUs

ThomasDelteil commented 6 years ago

Description

When adding NDArray on different contexts, I get either:

warning of different context: GPU 0 -> CPU
error + crash: GPU 0 -> GPU N with N != 0

Environment info (Required)

----------Python Info----------
Version      : 3.6.3
Compiler     : GCC 7.2.0
Build        : ('default', 'Oct 13 2017 12:02:49')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 9.0.3
Directory    : /home/ubuntu/anaconda3/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version      : 1.3.0
Directory    : /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet
Commit Hash   : 74479b89eaba8241573079aa5e32f0ba0f8dd00e
----------System Info----------
Platform     : Linux-4.4.0-1052-aws-x86_64-with-debian-stretch-sid
system       : Linux
node         : ip-172-31-23-125
release      : 4.4.0-1052-aws
version      : #61-Ubuntu SMP Mon Feb 12 23:05:58 UTC 2018
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               2699.984
CPU max MHz:           3000.0000
CPU min MHz:           1200.0000
BogoMIPS:              4600.10
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-31
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single retpoline kaiser fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0020 sec, LOAD: 0.4260 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1239 sec, LOAD: 0.6256 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.1370 sec, LOAD: 0.5648 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0414 sec, LOAD: 0.5376 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0035 sec, LOAD: 0.2710 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0190 sec, LOAD: 0.1067 sec.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30                 Driver Version: 390.30                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:1B.0 Off |                    0 |
| N/A   43C    P0    51W / 300W |      0MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:00:1C.0 Off |                    0 |
| N/A   42C    P0    48W / 300W |      0MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:00:1D.0 Off |                    0 |
| N/A   41C    P0    52W / 300W |      0MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   43C    P0    53W / 300W |      0MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Build info (Required if built from source)

pip install mxnet-cu91mkl --pre

also happens with 1.2.0

pip install mxnet-cu91

Error Message:

terminate called after throwing an instance of 'dmlc::Error'
  what():  [06:45:53] /home/travis/build/dmlc/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:182: Check failed: e == cudaSuccess CUDA: invalid resource handle

Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x31d18a) [0x7eff1015e18a]
[bt] (1) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x31d7a1) [0x7eff1015e7a1]
[bt] (2) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2769244) [0x7eff125aa244]
[bt] (3) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2769b55) [0x7eff125aab55]
[bt] (4) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2780367) [0x7eff125c1367]
[bt] (5) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2780606) [0x7eff125c1606]
[bt] (6) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x277a5f4) [0x7eff125bb5f4]
[bt] (7) /home/ubuntu/anaconda3/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7eff8d332c5c]
[bt] (8) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7eff8e57c6ba]
[bt] (9) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7eff8e2b241d]

Steps to reproduce

(Paste the commands you ran that produced the error.)

from mxnet import nd
import mxnet as mx
nd.add(nd.ones(1, ctx=mx.gpu(0)), nd.ones(1, ctx=mx.gpu(3)))

ThomasDelteil commented 6 years ago

Shouldn't it return the same error than when trying to add from CPU ? Note that sometimes I get Illegal Memory Access error instead of invalid handle

===== edit I realized that after training my multi gpu model, where I stored the loss for each GPU in separate values on separate GPUs I am able to add these losses together without copying them across devices. Can someone explain to me why this is possible? I thought you cannot add across GPUs ?

train_loss

[
 [ 49.52454758]
 <NDArray 1 @gpu(0)>, 
 [ 49.66656113]
 <NDArray 1 @gpu(1)>]

train_loss[0] + train_loss[1]

[ 99.1911087]
<NDArray 1 @gpu(0)>

train_loss[1] + train_loss[0]

[ 99.1911087]
<NDArray 1 @gpu(1)>

kalyc commented 6 years ago

Thanks for submitting this issue @ThomasDelteil Could you add labels "Memory", "Bug" to this?

ThomasDelteil commented 6 years ago

@kalyc I am not a committer and do not have labelling rights @nswamy could you add the labels please?

apache / mxnet