LaurentMazare / ocaml-torch

OCaml bindings for PyTorch
Apache License 2.0
416 stars 38 forks source link

Programs that use ocaml-torch with GPU acceleration segfault right before terminating #43

Open jonathan-laurent opened 4 years ago

jonathan-laurent commented 4 years ago

Programs I write using ocaml-torch that use GPU acceleration segfault right before terminating:

Segmentation fault (core dumped)

This is not a huge deal as it happens when the program is about to terminate anyway but I was wondering if you had observed the same phenomenon.

In particular, I replicated the problem on your mnist/conv and char_rnn examples.

LaurentMazare commented 4 years ago

That's strange, I don't get any such error. Does it also happen when running examples/basics/basics.exe? Could you try running it within gdb if you have it installed?

jonathan-laurent commented 4 years ago

The bug does not happen with basics/basics.exe.

I ran the mnist/conv.exe example within GDB and got the following backtrace:

(gdb) run
Starting program: /home/jonathan/neurarith/_build/default/deps/ocaml-torch/examples/mnist/conv.exe 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fff9bcdd700 (LWP 8205)]
[New Thread 0x7fff9a5e6700 (LWP 8206)]
[New Thread 0x7fff99de5700 (LWP 8207)]
[New Thread 0x7fff995e4700 (LWP 8208)]
[New Thread 0x7fff98de3700 (LWP 8209)]
[New Thread 0x7fff93fff700 (LWP 8210)]
[New Thread 0x7fff91a15700 (LWP 8211)]
[New Thread 0x7fff91214700 (LWP 8212)]
[New Thread 0x7fff90a13700 (LWP 8213)]
[New Thread 0x7fff5dfff700 (LWP 8214)]
50 0.268205 94.07%
...
4950 0.000706 99.09%
5000 0.005436 98.99%
[Thread 0x7fff90a13700 (LWP 8213) exited]
[Thread 0x7fff5dfff700 (LWP 8214) exited]

Thread 1 "conv.exe" received signal SIGSEGV, Segmentation fault.
0x00007fffe817c23e in ?? () from /usr/local/cuda-10.2/lib64/libcudart.so.10.2

(gdb) backtrace
#0  0x00007fffe817c23e in ?? () from /usr/local/cuda-10.2/lib64/libcudart.so.10.2
#1  0x00007fffe818170b in ?? () from /usr/local/cuda-10.2/lib64/libcudart.so.10.2
#2  0x00007fffe81ae2d0 in cudaStreamDestroy () from /usr/local/cuda-10.2/lib64/libcudart.so.10.2
#3  0x00007fffa7fdb51d in cudnnDestroy () from /home/jonathan/Software/libtorch/lib/libtorch_cuda.so
#4  0x00007fffa72dda05 in at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cudnnContext*, &at::native::(anonymous namespace)::createCuDNNHandle, &at::native::(anonymous namespace)::destroyCuDNNHandle>::~DeviceThreadHandlePool() () from /home/jonathan/Software/libtorch/lib/libtorch_cuda.so
#5  0x00007fffe5a22615 in __cxa_finalize (d=0x7fffe1e763c0) at cxa_finalize.c:83
#6  0x00007fffa4b3ea83 in __do_global_dtors_aux () from /home/jonathan/Software/libtorch/lib/libtorch_cuda.so
#7  0x00007fffffffdad0 in ?? ()
#8  0x00007ffff7de5b73 in _dl_fini () at dl-fini.c:138
Backtrace stopped: frame did not save the PC

I suspect this is not very useful as GDB is missing some debug symbols. Would you be able to recommend some build options to get a more useful backtrace?

zbroyar commented 4 years ago

I had the similar issue and cured it with Caml.Gc.full_major() after each epoch.

jonathan-laurent commented 4 years ago

I have also observed that not calling the GC often enough during training can result in segfaults but I suspect the problem is different here. For example, conv/mnist.exe already calls the GC after each epoch but still displays the problem on my machine.

Kwonsoo commented 3 years ago

Hi all,

I also met segmentation fault, and spent some time for the recent two days trying to resolve the issue.

I was running some optimization procedures other than the basic examples here, which I cannot share here, and after several epochs, the process just terminated with the Segmentation fault (core dumped) error message. In the /var/log/syslog file, I found the relevant line something like below: Dec 17 12:08:55 h02 kernel: [1880432.552416] main.exe[29569]: segfault at 7fd86f5db908 ip 00007fd86153e7ee sp 00007fff73c98f00 error 4 in libtorch.so[7fd86037b000+e8d1000]; or Dec 17 11:34:34 h02 kernel: [1878371.127518] traps: main.exe[18968] general protection ip:14bc7b71d19c sp:7ffd41c02ed8 error:0 in libc10.so[14bc7b701000+43000].

I have made several attempts below before running dune exec ...: (1) ran ulimit -s unlimited (2) ran sudo apt update; sudo apt upgrade (3) ran opam update; opam upgrade Trial (1) and (2) did not help, but Trial (3) actually resolved the issue. I think upgrading thelibtorch version here helped. I also add some detailed information about what opam upgrade did on my computer for completeness below.

The following actions will be performed:
  ↗ upgrade   num                     1.3 to 1.4
  ↗ upgrade   dune                    2.4.0 to 2.7.1
  ↗ upgrade   conf-openblas           0.2.0 to 0.2.1
  ↗ upgrade   conf-pkg-config         1.1 to 1.3
  ↗ upgrade   libtorch                1.4.0 to 1.7.0+linux-x86_64
  ↗ upgrade   topkg                   1.0.1 to 1.0.3
  ↗ upgrade   batteries               3.0.0 to 3.2.0
  ∗ install   trie                    1.0.0                       [required by mew]
  ∗ install   octavius                1.2.2                       [required by ppx_js_style]
  ∗ install   jane-street-headers     v0.14.0                     [required by time_now]
  ↗ upgrade   sexplib0                v0.13.0 to v0.14.0
  ↗ upgrade   ocaml-migrate-parsetree 1.6.0 to 2.1.0
  ↗ upgrade   ocaml-compiler-libs     v0.12.1 to v0.12.3
  ↗ upgrade   integers                0.3.0 to 0.4.0
  ↗ upgrade   dune-private-libs       2.4.0 to 2.7.1
  ↻ recompile stdlib-shims            0.1.0                       [uses dune]
  ↻ recompile result                  1.5                         [uses dune]
  ↻ recompile re                      1.9.0                       [uses dune]
  ↻ recompile ppx_derivers            1.2.1                       [uses dune]
  ↻ recompile npy                     0.0.9                       [uses dune]
  ↻ recompile mmap                    1.1.0                       [uses dune]
  ↻ recompile csv                     2.4                         [uses dune]
  ↻ recompile cppo                    1.6.6                       [uses dune]
  ↻ recompile camomile                1.0.2                       [uses dune]
  ∗ install   conf-libffi             2.0.0                       [required by ctypes-foreign]
  ↗ upgrade   astring                 0.8.3 to 0.8.5
  ↻ recompile b0                      0.0.1                       [uses topkg]
  ↻ recompile owl-base                0.9.0                       [uses dune]
  ∗ install   mew                     0.1.0                       [required by mew_vi]
  ∗ install   csexp                   1.3.2                       [required by dune-configurator]
  ↻ recompile tyxml                   4.4.0                       [uses dune]
  ↗ upgrade   ppxlib                  0.12.0 to 0.17.0
  ↗ upgrade   ocplib-endian           1.0 to 1.1
  ↻ recompile charInfo_width          1.1.0                       [uses dune]
  ↻ recompile ctypes-foreign          0.4.0                       [upstream changes]
  ↗ upgrade   fpath                   0.7.2 to 0.7.3
  ∗ install   mew_vi                  0.5.0                       [required by lambda-term]
  ↗ upgrade   dune-configurator       2.4.0 to 2.7.1
  ↗ upgrade   zed                     2.0.6 to 3.1.0
  ↻ recompile ctypes                  0.17.1                      [uses integers, conf-pkg-config, ctypes-foreign]
  ↗ upgrade   odoc                    1.5.0 to 1.5.2
  ↗ upgrade   lwt                     5.2.0 to 5.3.0
  ↗ upgrade   base                    v0.13.1 to v0.14.0
  ↗ upgrade   eigen                   0.2.0 to 0.3.0
  ↻ recompile odig                    0.0.5                       [uses odoc, topkg]
  ↻ recompile lwt_react               1.1.3                       [uses dune, lwt]
  ↻ recompile lwt_log                 1.1.1                       [uses dune, lwt]
  ∗ install   ppx_js_style            v0.14.0                     [required by ppx_base]
  ∗ install   ppx_enumerate           v0.14.0                     [required by ppx_base]
  ↗ upgrade   variantslib             v0.13.0 to v0.14.0
  ↗ upgrade   stdio                   v0.13.0 to v0.14.0
  ↗ upgrade   ppx_sexp_conv           v0.13.0 to v0.14.1
  ↗ upgrade   ppx_here                v0.13.0 to v0.14.0
  ↗ upgrade   ppx_compare             v0.13.0 to v0.14.0
  ↗ upgrade   ppx_cold                v0.13.0 to v0.14.0
  ↗ upgrade   parsexp                 v0.13.0 to v0.14.0
  ↗ upgrade   fieldslib               v0.13.0 to v0.14.0
  ↗ upgrade   lambda-term             2.0.3 to 3.1.0
  ↗ upgrade   ppx_variants_conv       v0.13.0 to v0.14.1
  ∗ install   ppx_optcomp             v0.14.0                     [required by time_now]
  ↻ recompile owl                     0.9.0*                      [uses eigen, dune, base, etc.]
  ↗ upgrade   ppx_custom_printf       v0.13.0 to v0.14.0
  ∗ install   ppx_hash                v0.14.0                     [required by ppx_base]
  ↗ upgrade   ppx_assert              v0.13.0 to v0.14.0
  ↗ upgrade   sexplib                 v0.13.0 to v0.14.0
  ↗ upgrade   ppx_fields_conv         v0.13.0 to v0.14.1
  ↗ upgrade   utop                    2.4.3 to 2.6.0
  ∗ install   ppx_base                v0.14.0                     [required by time_now]
  ∗ install   jst-config              v0.14.0                     [required by time_now]
  ∗ install   time_now                v0.14.0                     [required by ppx_inline_test]
  ↗ upgrade   ppx_inline_test         v0.13.0 to v0.14.1
  ↗ upgrade   ppx_expect              v0.13.0 to v0.14.0
  ↗ upgrade   torch                   0.8 to 0.11

I hope it will save people time for debugging in the future.

Thanks, Gwonsoo