denizyuret / Knet.jl

Koç University deep learning framework.
https://denizyuret.github.io/Knet.jl/latest
Other
1.43k stars 230 forks source link

Multiple GPU Error (Core Dumped) #301

Open ceteke opened 6 years ago

ceteke commented 6 years ago

Hi,

I have been trying to train different models on multiple GPUs. When I run the codes at the same time it works fine. However, if an already running program allocated memory on a GPU I get Core Dumped error when I try to run another program on another GPU (I'm using Knet.gpu(id) to set the GPU).

I tried running Tensorflow and Pytorch and got the same error but I don't get an error if I set CUDA_VISIBLE_DEVICES environment variable. This is not the case in Knet, even if I set the environment variable to 3 Knet.gpu() returns 0 and when I try to change the gpu with Knet.gpu(id) it fails.

Here is the error if CUDA_VISIBLE_DEVICES is not set:

cem@Glados:~$ unset CUDA_VISIBLE_DEVICES
cem@Glados:~$ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.2 (2017-12-13 18:08 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> using Knet
*** Error in `julia': double free or corruption (!prev): 0x000000000418f140 ***
======= Backtrace: =========
...

signal (6): Aborted
while loading no file, in expression starting on line 0
raise at /build/glibc-Cl5G7W/glibc-2.23/signal/../sysdeps/unix/sysv/linux/raise.c:54
abort at /build/glibc-Cl5G7W/glibc-2.23/stdlib/abort.c:89
__libc_message at /build/glibc-Cl5G7W/glibc-2.23/libio/../sysdeps/posix/libc_fatal.c:175
malloc_printerr at /build/glibc-Cl5G7W/glibc-2.23/malloc/malloc.c:5006 [inlined]
_int_free at /build/glibc-Cl5G7W/glibc-2.23/malloc/malloc.c:3867
__libc_free at /build/glibc-Cl5G7W/glibc-2.23/malloc/malloc.c:2968
unknown function (ip: 0x7f3a964f0d7b)
unknown function (ip: 0x7f3a964f0dc2)
unknown function (ip: 0x7f3a964f1063)
unknown function (ip: 0x7f3a963e392f)
unknown function (ip: 0x7f3a963bdabb)
cuInit at /usr/lib/x86_64-linux-gnu/libcuda.so.1 (unknown line)
unknown function (ip: 0x7f3a9b2fb8a9)
unknown function (ip: 0x7f3a9b2fb900)
unknown function (ip: 0x7f3ad8312a98)
unknown function (ip: 0x7f3a9b333868)
unknown function (ip: 0x7f3a9b2f7b69)
unknown function (ip: 0x7f3a9b2fcd8a)
cudaRuntimeGetVersion at /usr/local/cuda-9.0/lib64/libcudart.so (unknown line)
gpu at /home/cem/.julia/v0.6/Knet/src/gpu.jl:85
__init__ at /home/cem/.julia/v0.6/Knet/src/Knet.jl:56
unknown function (ip: 0x7f3a9f9543af)
jl_call_fptr_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:339 [inlined]
jl_call_method_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:358 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:1926
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1424 [inlined]
jl_module_run_initializer at /buildworker/worker/package_linux64/build/src/toplevel.c:87
jl_init_restored_modules at /buildworker/worker/package_linux64/build/src/dump.c:2443 [inlined]
_jl_restore_incremental at /buildworker/worker/package_linux64/build/src/dump.c:3318
jl_restore_incremental at /buildworker/worker/package_linux64/build/src/dump.c:3338
_include_from_serialized at ./loading.jl:157
_require_from_serialized at ./loading.jl:200
unknown function (ip: 0x7f3ad22c8ce0)
jl_call_fptr_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:339 [inlined]
jl_call_method_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:358 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:1926
_require_search_from_serialized at ./loading.jl:236
unknown function (ip: 0x7f3ad22c94d0)
jl_call_fptr_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:339 [inlined]
jl_call_method_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:358 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:1926
_require at ./loading.jl:441
require at ./loading.jl:405
unknown function (ip: 0x7f3ad22cad4b)
jl_call_fptr_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:339 [inlined]
jl_call_method_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:358 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:1926
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1424 [inlined]
eval_import_path_ at /buildworker/worker/package_linux64/build/src/toplevel.c:403
eval_import_path at /buildworker/worker/package_linux64/build/src/toplevel.c:430 [inlined]
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:495
jl_toplevel_eval_in at /buildworker/worker/package_linux64/build/src/builtins.c:496
eval at ./boot.jl:235
unknown function (ip: 0x7f3ad21a0d2f)
jl_call_fptr_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:339 [inlined]
jl_call_method_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:358 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:1926
eval_user_input at ./REPL.jl:66
unknown function (ip: 0x7f3ad222218f)
jl_call_fptr_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:339 [inlined]
jl_call_method_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:358 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:1926
macro expansion at ./REPL.jl:97 [inlined]
#1 at ./event.jl:73
unknown function (ip: 0x7f3a9f9522af)
jl_call_fptr_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:339 [inlined]
jl_call_method_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:358 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:1926
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1424 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:267
unknown function (ip: 0xffffffffffffffff)
Allocations: 2184233 (Pool: 2182906; Big: 1327); GC: 2
Aborted (core dumped)

And this is what happens when CUDA_VISIBLE_DEVICES is set

cem@Glados:~$ export CUDA_VISIBLE_DEVICES=3
cem@Glados:~$ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.2 (2017-12-13 18:08 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> using Knet

julia> gpu()
0

julia> gpu(3)
ERROR: cudart.cudaSetDevice error 10
Stacktrace:
 [1] gpu(::Int64) at /home/cem/.julia/v0.6/Knet/src/gpu.jl:18

GPU status

cem@Glados:~$ nvidia-smi
Sun Apr 22 13:53:32 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30                 Driver Version: 390.30                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:02:00.0 Off |                  N/A |
| 23%   35C    P8    17W / 250W |     10MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A |
| 23%   32C    P8    16W / 250W |    165MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:82:00.0 Off |                  N/A |
| 32%   60C    P2    84W / 250W |  11175MiB / 11178MiB |     86%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:83:00.0 Off |                  N/A |
| 23%   26C    P8    16W / 250W |     10MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    1     39641      C   julia                                        155MiB |
|    2     39641      C   julia                                      11165MiB |
+-----------------------------------------------------------------------------+
ceteke commented 6 years ago

I have allocated some memory on the GPU to see if the environment variable works:

cem@Glados:~$ export CUDA_VISIBLE_DEVICES=3
cem@Glados:~$ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.2 (2017-12-13 18:08 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> using Knet; gpu()
0

julia> dummy = KnetArray{Float32}(randn(100,100,100,100))

New GPU Status

cem@Glados:~$ nvidia-smi
Sun Apr 22 14:04:29 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30                 Driver Version: 390.30                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:02:00.0 Off |                  N/A |
| 23%   36C    P8    17W / 250W |     10MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A |
| 23%   37C    P8    16W / 250W |    165MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:82:00.0 Off |                  N/A |
| 44%   77C    P2   217W / 250W |  11175MiB / 11178MiB |     38%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:83:00.0 Off |                  N/A |
| 23%   31C    P2    56W / 250W |    555MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    1     39641      C   julia                                        155MiB |
|    2     39641      C   julia                                      11165MiB |
|    3      2692      C   julia                                        537MiB |
+-----------------------------------------------------------------------------+

You can see that KnetArray is allocated in GPU 3 even though Knet.gpu() returns 0.

However there is a random behaviour Knet.gpu() sometimes returns -1 if I set the environment variable and it fails to allocate memory.

cem@Glados:~$ export CUDA_VISIBLE_DEVICES=1
cem@Glados:~$ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.2 (2017-12-13 18:08 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> using Knet; gpu()
-1

julia> KnetArray{Float32}(randn(100,100,100))
ERROR: KnetPtr: bad device id -1.
Stacktrace:
 [1] Knet.KnetPtr(::Int64) at /home/cem/.julia/v0.6/Knet/src/kptr.jl:73
 [2] Type at /home/cem/.julia/v0.6/Knet/src/karray.jl:104 [inlined]
 [3] Type at /home/cem/.julia/v0.6/Knet/src/karray.jl:114 [inlined]
 [4] convert(::Type{Knet.KnetArray{Float32,3}}, ::Array{Float64,3}) at /home/cem/.julia/v0.6/Knet/src/karray.jl:153
 [5] Knet.KnetArray{Float32,N} where N(::Array{Float64,3}) at ./sysimg.jl:77
cangumeli commented 6 years ago

Can you check Knet.nvmlfound?

ceteke commented 6 years ago

It gives Core Dumped if I don't set CUDA_VISIBLE_DEVICES.

...
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:1926
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1424 [inlined]
jl_module_run_initializer at /buildworker/worker/package_linux64/build/src/toplevel.c:87
jl_init_restored_modules at /buildworker/worker/package_linux64/build/src/dump.c:2443 [inlined]
_jl_restore_incremental at /buildworker/worker/package_linux64/build/src/dump.c:3318
jl_restore_incremental at /buildworker/worker/package_linux64/build/src/dump.c:3338
_include_from_serialized at ./loading.jl:157
_require_from_serialized at ./loading.jl:200
unknown function (ip: 0x7f40c7344ce0)
jl_call_fptr_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:339 [inlined]
jl_call_method_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:358 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:1926
_require_search_from_serialized at ./loading.jl:236
unknown function (ip: 0x7f40c73454d0)
jl_call_fptr_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:339 [inlined]
jl_call_method_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:358 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:1926
_require at ./loading.jl:441
require at ./loading.jl:405
unknown function (ip: 0x7f40c7346d4b)
jl_call_fptr_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:339 [inlined]
jl_call_method_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:358 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:1926
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1424 [inlined]
eval_import_path_ at /buildworker/worker/package_linux64/build/src/toplevel.c:403
eval_import_path at /buildworker/worker/package_linux64/build/src/toplevel.c:430 [inlined]
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:495
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:551
jl_toplevel_eval_in at /buildworker/worker/package_linux64/build/src/builtins.c:496
eval at ./boot.jl:235
unknown function (ip: 0x7f40c721cd2f)
jl_call_fptr_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:339 [inlined]
jl_call_method_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:358 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:1926
eval_user_input at ./REPL.jl:66
unknown function (ip: 0x7f40c729e18f)
jl_call_fptr_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:339 [inlined]
jl_call_method_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:358 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:1926
macro expansion at ./REPL.jl:97 [inlined]
#1 at ./event.jl:73
unknown function (ip: 0x7f40949ce2af)
jl_call_fptr_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:339 [inlined]
jl_call_method_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:358 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:1926
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1424 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:267
unknown function (ip: 0xffffffffffffffff)
Allocations: 2191844 (Pool: 2190515; Big: 1329); GC: 2
Aborted (core dumped)

However, If I set CUDA_VISIBLE_DEVICES

cem@Glados:~$ export CUDA_VISIBLE_DEVICES=0
cem@Glados:~$ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.2 (2017-12-13 18:08 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> using Knet; Knet.nvmlfound
true
ngphuoc commented 6 years ago

Can you try export CUDA_DEVICE_ORDER=PCI_BUS_ID