FluxML / Torch.jl

Sensible extensions for exposing torch in Julia.
Other
211 stars 14 forks source link

Error upon installing #42

Open ViralBShah opened 3 years ago

ViralBShah commented 3 years ago

Using Julia 1.5.3 on a computer with GPU:

julia> using Torch
[ Info: Precompiling Torch [6a2ea274-3061-11ea-0d63-ff850051a295]
ERROR: LoadError: InitError: could not load library "/home/viralbshah/.julia/artifacts/d6ce2ca09ab00964151aaeae71179deb8f9800d1/lib/libdoeye_caml.so"
libcublas.so.10: cannot open shared object file: No such file or directory
Stacktrace:
 [1] dlopen(::String, ::UInt32; throw_error::Bool) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Libdl/src/Libdl.jl:109
 [2] dlopen at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Libdl/src/Libdl.jl:109 [inlined] (repeats 2 times)
ViralBShah commented 3 years ago

Perhaps the same as #32?

k8lion commented 3 years ago

I am having the same issue both locally and on a cluster. Both have Julia 1.5.3 and CUDA 11.0.

julia> using Torch
[ Info: Precompiling Torch [6a2ea274-3061-11ea-0d63-ff850051a295]
ERROR: LoadError: InitError: could not load library "/tmpdir/maile/.julia/artifacts/d6ce2ca09ab00964151aaeae71179deb8f9800d1/lib/libdoeye_caml.so"
libcufft.so.10: cannot open shared object file: No such file or directory
Stacktrace:
 [1] dlopen(::String, ::UInt32; throw_error::Bool) at /usr/local/julia/julia-1.5.3/share/julia/stdlib/v1.5/Libdl/src/Libdl.jl:109
 [2] dlopen at /usr/local/julia/julia-1.5.3/share/julia/stdlib/v1.5/Libdl/src/Libdl.jl:109 [inlined] (repeats 2 times)

Trying on CUDA 10.1 yields a similar error:

julia> using Torch
[ Info: Precompiling Torch [6a2ea274-3061-11ea-0d63-ff850051a295]
ERROR: LoadError: InitError: could not load library "/tmpdir/maile/.julia/artifacts/d6ce2ca09ab00964151aaeae71179deb8f9800d1/lib/libdoeye_caml.so"
/lib64/libm.so.6: version `GLIBC_2.23' not found (required by /tmpdir/maile/.julia/artifacts/d6ce2ca09ab00964151aaeae71179deb8f9800d1/lib/libtorch.so)
Stacktrace:
 [1] dlopen(::String, ::UInt32; throw_error::Bool) at /usr/local/julia/julia-1.5.3/share/julia/stdlib/v1.5/Libdl/src/Libdl.jl:109
 [2] dlopen at /usr/local/julia/julia-1.5.3/share/julia/stdlib/v1.5/Libdl/src/Libdl.jl:109 [inlined] (repeats 2 times)
DhairyaLGandhi commented 3 years ago

What is the versioninfo()?

k8lion commented 3 years ago

Locally

julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1* (2020-11-09 13:37 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-7440HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-10.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = atom  -a
  JULIA_NUM_THREADS = 4

On cluster

julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake-avx512)
Environment:
  JULIA_DEPOT_PATH = /tmpdir/maile/.julia
k8lion commented 3 years ago

My first errors were produced with the latest version release. On master locally, I get

libcublas.so.10: cannot open shared object file: No such file or directory

On master on the cluster, the errors are the same.

DhairyaLGandhi commented 3 years ago

That looks like an issue with the local CUDA setup. We should really just setup lazy artifacts to make these errors go away entirely.

PerezHz commented 3 years ago

Hit the same issue (I think) on a GPU machine with Julia 1.6.1 and a fresh environment:

julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, broadwell)

(@v1.6) pkg> st
      Status `~/.julia/environments/v1.6/Project.toml`
  [052768ef] CUDA v3.3.3
  [587475ba] Flux v0.12.4
  [7073ff75] IJulia v1.23.2
  [6a2ea274] Torch v0.1.2

julia> using CUDA; CUDA.versioninfo()
CUDA toolkit 11.3.1, artifact installation
CUDA driver 11.2.0
NVIDIA driver 460.73.1

Libraries: 
- CUBLAS: 11.5.1
- CURAND: 10.2.4
- CUFFT: 10.4.2
- CUSOLVER: 11.1.2
- CUSPARSE: 11.6.0
- CUPTI: 14.0.0
- NVML: 11.0.0+460.73.1
- CUDNN: 8.20.0 (for CUDA 11.3.0)
- CUTENSOR: 1.3.0 (for CUDA 11.2.0)

Toolchain:
- Julia: 1.6.1
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
  0: Tesla T4 (sm_75, 14.414 GiB / 14.756 GiB available)

julia> using Torch
[ Info: Precompiling Torch [6a2ea274-3061-11ea-0d63-ff850051a295]
ERROR: LoadError: InitError: could not load library "/home/jupyter/.julia/artifacts/d6ce2ca09ab00964151aaeae71179deb8f9800d1/lib/libdoeye_caml.so"
libcublas.so.10: cannot open shared object file: No such file or directory
Stacktrace:
  [1] dlopen(s::String, flags::UInt32; throw_error::Bool)
    @ Base.Libc.Libdl ./libdl.jl:114
  [2] dlopen (repeats 2 times)
    @ ./libdl.jl:114 [inlined]
  [3] __init__()
    @ Torch_jll ~/.julia/packages/Torch_jll/sFQc0/src/wrappers/x86_64-linux-gnu-cxx11.jl:57
  [4] _include_from_serialized(path::String, depmods::Vector{Any})
    @ Base ./loading.jl:674
  [5] _require_search_from_serialized(pkg::Base.PkgId, sourcepath::String)
    @ Base ./loading.jl:760
  [6] _require(pkg::Base.PkgId)
    @ Base ./loading.jl:998
  [7] require(uuidkey::Base.PkgId)
    @ Base ./loading.jl:914
  [8] require(into::Module, mod::Symbol)
    @ Base ./loading.jl:901
  [9] include
    @ ./Base.jl:386 [inlined]
 [10] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt64}}, source::Nothing)
    @ Base ./loading.jl:1213
 [11] top-level scope
    @ none:1
 [12] eval
    @ ./boot.jl:360 [inlined]
 [13] eval(x::Expr)
    @ Base.MainInclude ./client.jl:446
 [14] top-level scope
    @ none:1
during initialization of module Torch_jll
in expression starting at /home/jupyter/.julia/packages/Torch/fIKJf/src/Torch.jl:1
ERROR: Failed to precompile Torch [6a2ea274-3061-11ea-0d63-ff850051a295] to /home/jupyter/.julia/compiled/v1.6/Torch/jl_Yw2dNx.
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] compilecache(pkg::Base.PkgId, path::String, internal_stderr::Base.TTY, internal_stdout::Base.TTY)
   @ Base ./loading.jl:1360
 [3] compilecache(pkg::Base.PkgId, path::String)
   @ Base ./loading.jl:1306
 [4] _require(pkg::Base.PkgId)
   @ Base ./loading.jl:1021
 [5] require(uuidkey::Base.PkgId)
   @ Base ./loading.jl:914
 [6] require(into::Module, mod::Symbol)
   @ Base ./loading.jl:901
 [7] top-level scope
   @ ~/.julia/packages/CUDA/02Kjq/src/initialization.jl:52

is there any recommended workaround?

LeeLizuoLiu commented 2 years ago

For this issue, one workaround you could try is to link the cuda library by ln -s ~/path/to/libcublas.so.10 /home/jupyter/.julia/artifacts/d6ce2ca09ab00964151aaeae71179deb8f9800d1/lib/libcublas.so.10