Open ceteke opened 6 years ago
I have allocated some memory on the GPU to see if the environment variable works:
cem@Glados:~$ export CUDA_VISIBLE_DEVICES=3
cem@Glados:~$ julia
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: https://docs.julialang.org
_ _ _| |_ __ _ | Type "?help" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.6.2 (2017-12-13 18:08 UTC)
_/ |\__'_|_|_|\__'_| | Official http://julialang.org/ release
|__/ | x86_64-pc-linux-gnu
julia> using Knet; gpu()
0
julia> dummy = KnetArray{Float32}(randn(100,100,100,100))
New GPU Status
cem@Glados:~$ nvidia-smi
Sun Apr 22 14:04:29 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30 Driver Version: 390.30 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:02:00.0 Off | N/A |
| 23% 36C P8 17W / 250W | 10MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:03:00.0 Off | N/A |
| 23% 37C P8 16W / 250W | 165MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:82:00.0 Off | N/A |
| 44% 77C P2 217W / 250W | 11175MiB / 11178MiB | 38% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... Off | 00000000:83:00.0 Off | N/A |
| 23% 31C P2 56W / 250W | 555MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 1 39641 C julia 155MiB |
| 2 39641 C julia 11165MiB |
| 3 2692 C julia 537MiB |
+-----------------------------------------------------------------------------+
You can see that KnetArray is allocated in GPU 3 even though Knet.gpu() returns 0.
However there is a random behaviour Knet.gpu() sometimes returns -1 if I set the environment variable and it fails to allocate memory.
cem@Glados:~$ export CUDA_VISIBLE_DEVICES=1
cem@Glados:~$ julia
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: https://docs.julialang.org
_ _ _| |_ __ _ | Type "?help" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.6.2 (2017-12-13 18:08 UTC)
_/ |\__'_|_|_|\__'_| | Official http://julialang.org/ release
|__/ | x86_64-pc-linux-gnu
julia> using Knet; gpu()
-1
julia> KnetArray{Float32}(randn(100,100,100))
ERROR: KnetPtr: bad device id -1.
Stacktrace:
[1] Knet.KnetPtr(::Int64) at /home/cem/.julia/v0.6/Knet/src/kptr.jl:73
[2] Type at /home/cem/.julia/v0.6/Knet/src/karray.jl:104 [inlined]
[3] Type at /home/cem/.julia/v0.6/Knet/src/karray.jl:114 [inlined]
[4] convert(::Type{Knet.KnetArray{Float32,3}}, ::Array{Float64,3}) at /home/cem/.julia/v0.6/Knet/src/karray.jl:153
[5] Knet.KnetArray{Float32,N} where N(::Array{Float64,3}) at ./sysimg.jl:77
Can you check Knet.nvmlfound
?
It gives Core Dumped if I don't set CUDA_VISIBLE_DEVICES.
...
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:1926
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1424 [inlined]
jl_module_run_initializer at /buildworker/worker/package_linux64/build/src/toplevel.c:87
jl_init_restored_modules at /buildworker/worker/package_linux64/build/src/dump.c:2443 [inlined]
_jl_restore_incremental at /buildworker/worker/package_linux64/build/src/dump.c:3318
jl_restore_incremental at /buildworker/worker/package_linux64/build/src/dump.c:3338
_include_from_serialized at ./loading.jl:157
_require_from_serialized at ./loading.jl:200
unknown function (ip: 0x7f40c7344ce0)
jl_call_fptr_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:339 [inlined]
jl_call_method_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:358 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:1926
_require_search_from_serialized at ./loading.jl:236
unknown function (ip: 0x7f40c73454d0)
jl_call_fptr_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:339 [inlined]
jl_call_method_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:358 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:1926
_require at ./loading.jl:441
require at ./loading.jl:405
unknown function (ip: 0x7f40c7346d4b)
jl_call_fptr_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:339 [inlined]
jl_call_method_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:358 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:1926
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1424 [inlined]
eval_import_path_ at /buildworker/worker/package_linux64/build/src/toplevel.c:403
eval_import_path at /buildworker/worker/package_linux64/build/src/toplevel.c:430 [inlined]
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:495
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:551
jl_toplevel_eval_in at /buildworker/worker/package_linux64/build/src/builtins.c:496
eval at ./boot.jl:235
unknown function (ip: 0x7f40c721cd2f)
jl_call_fptr_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:339 [inlined]
jl_call_method_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:358 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:1926
eval_user_input at ./REPL.jl:66
unknown function (ip: 0x7f40c729e18f)
jl_call_fptr_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:339 [inlined]
jl_call_method_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:358 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:1926
macro expansion at ./REPL.jl:97 [inlined]
#1 at ./event.jl:73
unknown function (ip: 0x7f40949ce2af)
jl_call_fptr_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:339 [inlined]
jl_call_method_internal at /buildworker/worker/package_linux64/build/src/julia_internal.h:358 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:1926
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1424 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:267
unknown function (ip: 0xffffffffffffffff)
Allocations: 2191844 (Pool: 2190515; Big: 1329); GC: 2
Aborted (core dumped)
However, If I set CUDA_VISIBLE_DEVICES
cem@Glados:~$ export CUDA_VISIBLE_DEVICES=0
cem@Glados:~$ julia
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: https://docs.julialang.org
_ _ _| |_ __ _ | Type "?help" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.6.2 (2017-12-13 18:08 UTC)
_/ |\__'_|_|_|\__'_| | Official http://julialang.org/ release
|__/ | x86_64-pc-linux-gnu
julia> using Knet; Knet.nvmlfound
true
Can you try export CUDA_DEVICE_ORDER=PCI_BUS_ID
Hi,
I have been trying to train different models on multiple GPUs. When I run the codes at the same time it works fine. However, if an already running program allocated memory on a GPU I get Core Dumped error when I try to run another program on another GPU (I'm using Knet.gpu(id) to set the GPU).
I tried running Tensorflow and Pytorch and got the same error but I don't get an error if I set CUDA_VISIBLE_DEVICES environment variable. This is not the case in Knet, even if I set the environment variable to 3 Knet.gpu() returns 0 and when I try to change the gpu with Knet.gpu(id) it fails.
Here is the error if CUDA_VISIBLE_DEVICES is not set:
And this is what happens when CUDA_VISIBLE_DEVICES is set
GPU status