JuliaPy / PyCall.jl

Package to call Python functions from the Julia language
MIT License
1.46k stars 187 forks source link

loading pytorch causes invalid pointer crash on free() #973

Open sneiman opened 2 years ago

sneiman commented 2 years ago

Working on calling pytorch models from Julia using PyCall. Getting a consistent crash at exit of Julia with free(): invalid pointer.

Easily reproduced:

julia> using PyCall
julia> @pyimport torch

Ctrl-d to leave Julia ...

julia>
free(): invalid pointer

signal (6): Aborted
in expression starting at none:0
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7faf5a64b29d)
unknown function (ip: 0x7faf5a65332b)
unknown function (ip: 0x7faf5a654b5b)
_ZN4llvm2cl3optINS_15FunctionSummary23ForceSummaryHotnessTypeELb1ENS0_6parserIS3_EEED2Ev at /usr/local/julia-1.7.2/bin/../lib/julia/libLLVM-12jl.so (unknown line)
__cxa_finalize at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
__do_global_dtors_aux at /usr/local/julia-1.7.2/bin/../lib/julia/libLLVM-12jl.so (unknown line)
_fini at /usr/local/julia-1.7.2/bin/../lib/julia/libLLVM-12jl.so (unknown line)
unknown function (ip: 0x7faf5a6048d6)
exit at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
main at julia (unknown line)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x400808)
Allocations: 2632920 (Pool: 2632043; Big: 877); GC: 3

Using Ubuntu 20.04, Python 3.8.10, Julia 1.7.2, and pytorch 1.10.2+cu113.

Any help appreciated ...

philippwitte commented 2 years ago

Any updates on this? I'm having the same problem and this is a big show stopper. Any help would be appreciated!

Edit: It appears that this is a Pytorch issue, not a PyCall one. Once I downgrade to PyTorch 1.9.0 everything works fine.

sneiman commented 2 years ago

I never found. solution.

Sent from mobile device - sorry for typos and brevity.

This entire message is confidential. If it isn't intended for you, you may not use it - so please throw it away and forget about it.

On Jun 13, 2022, at 4:18 PM, Philipp Witte @.***> wrote:

 Any updates on this? I'm having the same problem and this is a big show stopper. Any help would be appreciated!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

eahenle commented 2 years ago

It would be nice to get a workaround on this... Downgrading to an obsolete version of PyTorch isn't an option for me, and this is breaking my CI.

And this is not a PyTorch issue. Doing import torch in Python doesn't cause this error, but doing @pyimport torch in Julia does. The problem must come about as a result of how PyCall interacts with PyTorch.

AnnaZav commented 1 year ago

The same problem for me.

free(): invalid pointer

signal (6): Aborted
in expression starting at none:0
pthread_kill at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
raise at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7f27d0d346f5)
unknown function (ip: 0x7f27d0d4bd7b)
unknown function (ip: 0x7f27d0d4dac3)
free at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
_ZN4llvm2cl3optINS_15FunctionSummary23ForceSummaryHotnessTypeELb1ENS0_6parserIS3_EEED2Ev at /home/anna/julia-1.7.3/bin/../lib/julia/libLLVM-12jl.so (unknown line)
__cxa_finalize at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
__do_global_dtors_aux at /home/anna/julia-1.7.3/bin/../lib/julia/libLLVM-12jl.so (unknown line)
_fini at /home/anna/julia-1.7.3/bin/../lib/julia/libLLVM-12jl.so (unknown line)
unknown function (ip: 0x7f27d0cf0494)
exit at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
main at /buildworker/worker/package_linux64/build/cli/loader_exe.c:45
unknown function (ip: 0x7f27d0cd4d8f)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
_start at /home/anna/julia-1.7.3/bin/julia (unknown line)
Allocations: 86792033 (Pool: 86759892; Big: 32141); GC: 50

Linux Mint 21, julia 1.7.3 Use Conda.jl : python 3.9.13, torch 1.12.0 Find same problem at issue https://github.com/JuliaLang/julia/issues/44242

GunnarFarneback commented 1 year ago

That Julia issue is very unlikely to be related. It is crashing with the same message but the backtrace is not at all similar.

This problem seems more likely to be some kind of LLVM clash between Julia and PyTorch. You can get a slightly more pretty backtrace with

$ gdb --quiet julia
Reading symbols from julia...
(gdb) set args -e 'using PyCall; pyimport("torch")'
(gdb) run
Starting program: /home/gunnar/.local/bin/julia -e 'using PyCall; pyimport("torch")'
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[...cutting out some thread related messages...]
free(): invalid pointer

Thread 1 "julia" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7da1859 in __GI_abort () at abort.c:79
#2  0x00007ffff7e0c26e in __libc_message (action=action@entry=do_abort, 
    fmt=fmt@entry=0x7ffff7f36298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3  0x00007ffff7e142fc in malloc_printerr (
    str=str@entry=0x7ffff7f344c1 "free(): invalid pointer") at malloc.c:5347
#4  0x00007ffff7e15b2c in _int_free (av=<optimized out>, p=<optimized out>, have_lock=0)
    at malloc.c:4173
#5  0x00007ffff4761289 in llvm::cl::opt<llvm::FunctionSummary::ForceSummaryHotnessType, true, llvm::cl::parser<llvm::FunctionSummary::ForceSummaryHotnessType> >::~opt() ()
   from /home/gunnar/julia1.7/bin/../lib/julia/libLLVM-12jl.so
#6  0x00007ffff7dc5fde in __cxa_finalize (d=0x7ffff6c066e0) at cxa_finalize.c:83
#7  0x00007ffff30e3b26 in __do_global_dtors_aux ()
   from /home/gunnar/julia1.7/bin/../lib/julia/libLLVM-12jl.so
#8  0x00007ffff7ffd060 in ?? () from /lib64/ld-linux-x86-64.so.2
#9  0x0000000000000000 in ?? ()

As a wild guess, the clash was introduced in PyTorch 1.10 with https://pytorch.org/blog/pytorch-1.10-released/#beta-cpu-fusion

GunnarFarneback commented 1 year ago

The problem seems to be that site-packages/torch/lib/libtorch_cpu.so contains a statically linked copy of LLVM, which clashes with Julia's LLVM library. This can be worked around, at least on Linux, by doing

using PyCall
pyimport("sys").setdlopenflags(10)
@pyimport torch

or more thoroughly with

using PyCall
@pyimport os
pyimport("sys").setdlopenflags(os.RTLD_NOW | os.RTLD_DEEPBIND)
@pyimport torch

(Note that Julia's Libdl.RTLD_DEEPBIND etc. uses other values than Python's os module, so can't use those here.)

terasakisatoshi commented 1 year ago

Let's use Julia 1.8.0 !!! It seems our issue is solved.

               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.8.0 (2022-08-17)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

(@v1.8) pkg> st PyCall
Status `~/.julia/environments/v1.8/Project.toml`
  [438e738f] PyCall v1.93.1

julia> using PyCall

julia> torch = pyimport("torch")
PyObject <module 'torch' from '/opt/conda/lib/python3.7/site-packages/torch/__init__.py'>

julia> torch.__version__
"1.11.0"

julia> versioninfo() # tested on GCP VM instance with NVIDIA Tesla T4 enabled
Julia Version 1.8.0
Commit 5544a0fab76 (2022-08-17 13:38 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × Intel(R) Xeon(R) CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, broadwell)
  Threads: 1 on 4 virtual cores

julia> exit() # no problem !!!
eahenle commented 1 year ago

Let's use Julia 1.8.0 !!! It seems our issue is solved. ... LIBM: libopenlibm LLVM: libLLVM-13.0.1 (ORCJIT, broadwell) Threads: 1 on 4 virtual cores

julia> exit() # no problem !!!

Ah, nice. Not surprised, given the LLVM-related changes in 1.8. I guess this issue can be closed?

GunnarFarneback commented 1 year ago

It's great that Julia 1.8 avoids the immediate problem (the crash on exit) but unless someone has evidence that this isn't just a lucky consequence of bumping LLVM from version 12 to 13, I wouldn't be surprised if this reappears in the future when PyTorch makes a corresponding LLVM bump and/or the PyTorch cpu-jit functionality doesn't work when loaded from Julia.