JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.41k stars 5.45k forks source link

virtualgl/vglrun + 1.9rc2 => fatal error could not load library libopenblas64_.so #49246

Open behinger opened 1 year ago

behinger commented 1 year ago

When I try to run julia with vglrun (via virtualgl on a headless server, virtual displays with NoMachine) I get a crash only in Julia1.9rc1/2 - but not in Julia 1.8.3. This is on an ubuntu 22 installation. Without virtualgl everythin works as intended.

It throws a: could not load library "libopenblas64_.so" - I dont know how to diagnose this further.

 vglrun ./julia
fatal: error thrown and no exception handler available.
InitError(mod=:OpenBLAS_jll, error=ErrorException("could not load library "libopenblas64_.so"
libopenblas64_.so: cannot open shared object file: No such file or directory"))
ijl_errorf at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/rtutils.c:77
ijl_load_dynamic_library at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/dlload.c:369
#dlopen#3 at ./libdl.jl:117
dlopen at ./libdl.jl:116 [inlined]
dlopen at ./libdl.jl:116 [inlined]
__init__ at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/usr/share/julia/stdlib/v1.9/OpenBLAS_jll/src/OpenBLAS_jll.jl:53
jfptr___init___57050.clone_1 at /home/ehinger/Downloads/julia-1.9.0-rc2/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1879 [inlined]
jl_module_run_initializer at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/toplevel.c:75
_finish_julia_init at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/init.c:850
julia_init at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/init.c:799
jl_repl_entrypoint at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/jlapi.c:711
main at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/cli/loader_exe.c:59
unknown function (ip: 0x7f73e2401d8f)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x401098)
  1. The output of versioninfo()
    
    Julia Version 1.9.0-rc2
    Commit 72aec423c2a (2023-04-01 10:41 UTC)
    Platform Info:
    OS: Linux (x86_64-linux-gnu)
    CPU: 128 × AMD EPYC 7452 32-Core Processor
    WORD_SIZE: 64
    LIBM: libopenlibm
    LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
    Threads: 1 on 128 virtual cores
    Environment:
    JULIA_DEPOT_PATH = ~/.julia
    LD_PRELOAD = 
2. How you installed Julia
```bash
wget + tar
  1. A minimal working example (MWE), also known as a minimum reproducible example
    ~/julia-1.9.0-rc2/bin ❯ /vglrun ./julia             
vtjnash commented 1 year ago

Try running with the LD_DEBUG environment variable set and see if that helps

behinger commented 1 year ago
12180    1145047:     
312181    1145047:     file=libopenblas64_.so [0];  dynamically loaded by /lib/libvglfaker.so [0]
312182    1145047:     find library=libopenblas64_.so [0]; searching
312183    1145047:      search cache=/etc/ld.so.cache
312184    1145047:      search path=/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/lib:/usr/lib              (system search path)
312185    1145047:       trying file=/lib/x86_64-linux-gnu/libopenblas64_.so
312186    1145047:       trying file=/usr/lib/x86_64-linux-gnu/libopenblas64_.so
312187    1145047:       trying file=/lib/libopenblas64_.so
312188    1145047:       trying file=/usr/lib/libopenblas64_.so
312189    1145047:     
312190 fatal: error thrown and no exception handler available.
312191 InitError(mod=:OpenBLAS_jll, error=ErrorException("could not load library "libopenblas64_.so"
312192 libopenblas64_.so: cannot open shared object file: No such file or directory"))

this is what I get immediately before

whereas this is what happens in julia 1.8.3

278925    1148742:     file=/opt/julia-1.8.3/bin/../lib/julia/libopenblas64_.so [0];  dynamically loaded by /lib/libvglfaker.so [0]
278926    1148742:     file=/opt/julia-1.8.3/bin/../lib/julia/libopenblas64_.so [0];  generating link map
278927    1148742:       dynamic: 0x00007f33c795da80  base: 0x00007f33c5b85000   size: 0x0000000001e7a2a8
278928    1148742:         entry: 0x00007f33c5cb5000  phdr: 0x00007f33c5b85040  phnum:                 11
vtjnash commented 1 year ago

It looks like libvglfaker.so may be dynamically replacing dlopen with a broken version. I am not sure we can do much about that. You might be able to get something mostly working with setting LD_LOAD_PATH.

behinger commented 1 year ago

ok, I added the lib/julia folder to LD_LIBRARY_PATH (not LD_LOAD_PATH, probably mixup with JULIA_LOAD_PATH?) which fixed this and I can start julia1.9 :)

But I still wonder why this is necessary in julia1.9 but not julia 1.8.3

abastola0 commented 4 months ago

I experienced the same issues with pytorch when using vglrun. Everything worked fine if i didn't use vglrun. I was running on a conda environment and saw similar issue during debugging. I got something like this initially:

    return torch.linalg.cholesky_ex(value).info.eq(0)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Error in dlopen: libtorch_cuda_linalg.so: cannot open shared object file: No such file or directory

I fixed it like this:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/my/virtual/env/lib/python3.11/site-packages/torch/lib/

I couldn't find any other way around this.