Open jonathan-laurent opened 4 years ago
That's strange, I don't get any such error. Does it also happen when running examples/basics/basics.exe
? Could you try running it within gdb if you have it installed?
The bug does not happen with basics/basics.exe
.
I ran the mnist/conv.exe
example within GDB and got the following backtrace:
(gdb) run
Starting program: /home/jonathan/neurarith/_build/default/deps/ocaml-torch/examples/mnist/conv.exe
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fff9bcdd700 (LWP 8205)]
[New Thread 0x7fff9a5e6700 (LWP 8206)]
[New Thread 0x7fff99de5700 (LWP 8207)]
[New Thread 0x7fff995e4700 (LWP 8208)]
[New Thread 0x7fff98de3700 (LWP 8209)]
[New Thread 0x7fff93fff700 (LWP 8210)]
[New Thread 0x7fff91a15700 (LWP 8211)]
[New Thread 0x7fff91214700 (LWP 8212)]
[New Thread 0x7fff90a13700 (LWP 8213)]
[New Thread 0x7fff5dfff700 (LWP 8214)]
50 0.268205 94.07%
...
4950 0.000706 99.09%
5000 0.005436 98.99%
[Thread 0x7fff90a13700 (LWP 8213) exited]
[Thread 0x7fff5dfff700 (LWP 8214) exited]
Thread 1 "conv.exe" received signal SIGSEGV, Segmentation fault.
0x00007fffe817c23e in ?? () from /usr/local/cuda-10.2/lib64/libcudart.so.10.2
(gdb) backtrace
#0 0x00007fffe817c23e in ?? () from /usr/local/cuda-10.2/lib64/libcudart.so.10.2
#1 0x00007fffe818170b in ?? () from /usr/local/cuda-10.2/lib64/libcudart.so.10.2
#2 0x00007fffe81ae2d0 in cudaStreamDestroy () from /usr/local/cuda-10.2/lib64/libcudart.so.10.2
#3 0x00007fffa7fdb51d in cudnnDestroy () from /home/jonathan/Software/libtorch/lib/libtorch_cuda.so
#4 0x00007fffa72dda05 in at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cudnnContext*, &at::native::(anonymous namespace)::createCuDNNHandle, &at::native::(anonymous namespace)::destroyCuDNNHandle>::~DeviceThreadHandlePool() () from /home/jonathan/Software/libtorch/lib/libtorch_cuda.so
#5 0x00007fffe5a22615 in __cxa_finalize (d=0x7fffe1e763c0) at cxa_finalize.c:83
#6 0x00007fffa4b3ea83 in __do_global_dtors_aux () from /home/jonathan/Software/libtorch/lib/libtorch_cuda.so
#7 0x00007fffffffdad0 in ?? ()
#8 0x00007ffff7de5b73 in _dl_fini () at dl-fini.c:138
Backtrace stopped: frame did not save the PC
I suspect this is not very useful as GDB is missing some debug symbols. Would you be able to recommend some build options to get a more useful backtrace?
I had the similar issue and cured it with Caml.Gc.full_major() after each epoch.
I have also observed that not calling the GC often enough during training can result in segfaults but I suspect the problem is different here. For example, conv/mnist.exe
already calls the GC after each epoch but still displays the problem on my machine.
Hi all,
I also met segmentation fault, and spent some time for the recent two days trying to resolve the issue.
I was running some optimization procedures other than the basic examples here, which I cannot share here, and after several epochs, the process just terminated with the Segmentation fault (core dumped)
error message. In the /var/log/syslog
file, I found the relevant line something like below:
Dec 17 12:08:55 h02 kernel: [1880432.552416] main.exe[29569]: segfault at 7fd86f5db908 ip 00007fd86153e7ee sp 00007fff73c98f00 error 4 in libtorch.so[7fd86037b000+e8d1000]
; or
Dec 17 11:34:34 h02 kernel: [1878371.127518] traps: main.exe[18968] general protection ip:14bc7b71d19c sp:7ffd41c02ed8 error:0 in libc10.so[14bc7b701000+43000]
.
I have made several attempts below before running dune exec ...
:
(1) ran ulimit -s unlimited
(2) ran sudo apt update; sudo apt upgrade
(3) ran opam update; opam upgrade
Trial (1) and (2) did not help, but Trial (3) actually resolved the issue. I think upgrading thelibtorch
version here helped. I also add some detailed information about what opam upgrade
did on my computer for completeness below.
The following actions will be performed:
↗ upgrade num 1.3 to 1.4
↗ upgrade dune 2.4.0 to 2.7.1
↗ upgrade conf-openblas 0.2.0 to 0.2.1
↗ upgrade conf-pkg-config 1.1 to 1.3
↗ upgrade libtorch 1.4.0 to 1.7.0+linux-x86_64
↗ upgrade topkg 1.0.1 to 1.0.3
↗ upgrade batteries 3.0.0 to 3.2.0
∗ install trie 1.0.0 [required by mew]
∗ install octavius 1.2.2 [required by ppx_js_style]
∗ install jane-street-headers v0.14.0 [required by time_now]
↗ upgrade sexplib0 v0.13.0 to v0.14.0
↗ upgrade ocaml-migrate-parsetree 1.6.0 to 2.1.0
↗ upgrade ocaml-compiler-libs v0.12.1 to v0.12.3
↗ upgrade integers 0.3.0 to 0.4.0
↗ upgrade dune-private-libs 2.4.0 to 2.7.1
↻ recompile stdlib-shims 0.1.0 [uses dune]
↻ recompile result 1.5 [uses dune]
↻ recompile re 1.9.0 [uses dune]
↻ recompile ppx_derivers 1.2.1 [uses dune]
↻ recompile npy 0.0.9 [uses dune]
↻ recompile mmap 1.1.0 [uses dune]
↻ recompile csv 2.4 [uses dune]
↻ recompile cppo 1.6.6 [uses dune]
↻ recompile camomile 1.0.2 [uses dune]
∗ install conf-libffi 2.0.0 [required by ctypes-foreign]
↗ upgrade astring 0.8.3 to 0.8.5
↻ recompile b0 0.0.1 [uses topkg]
↻ recompile owl-base 0.9.0 [uses dune]
∗ install mew 0.1.0 [required by mew_vi]
∗ install csexp 1.3.2 [required by dune-configurator]
↻ recompile tyxml 4.4.0 [uses dune]
↗ upgrade ppxlib 0.12.0 to 0.17.0
↗ upgrade ocplib-endian 1.0 to 1.1
↻ recompile charInfo_width 1.1.0 [uses dune]
↻ recompile ctypes-foreign 0.4.0 [upstream changes]
↗ upgrade fpath 0.7.2 to 0.7.3
∗ install mew_vi 0.5.0 [required by lambda-term]
↗ upgrade dune-configurator 2.4.0 to 2.7.1
↗ upgrade zed 2.0.6 to 3.1.0
↻ recompile ctypes 0.17.1 [uses integers, conf-pkg-config, ctypes-foreign]
↗ upgrade odoc 1.5.0 to 1.5.2
↗ upgrade lwt 5.2.0 to 5.3.0
↗ upgrade base v0.13.1 to v0.14.0
↗ upgrade eigen 0.2.0 to 0.3.0
↻ recompile odig 0.0.5 [uses odoc, topkg]
↻ recompile lwt_react 1.1.3 [uses dune, lwt]
↻ recompile lwt_log 1.1.1 [uses dune, lwt]
∗ install ppx_js_style v0.14.0 [required by ppx_base]
∗ install ppx_enumerate v0.14.0 [required by ppx_base]
↗ upgrade variantslib v0.13.0 to v0.14.0
↗ upgrade stdio v0.13.0 to v0.14.0
↗ upgrade ppx_sexp_conv v0.13.0 to v0.14.1
↗ upgrade ppx_here v0.13.0 to v0.14.0
↗ upgrade ppx_compare v0.13.0 to v0.14.0
↗ upgrade ppx_cold v0.13.0 to v0.14.0
↗ upgrade parsexp v0.13.0 to v0.14.0
↗ upgrade fieldslib v0.13.0 to v0.14.0
↗ upgrade lambda-term 2.0.3 to 3.1.0
↗ upgrade ppx_variants_conv v0.13.0 to v0.14.1
∗ install ppx_optcomp v0.14.0 [required by time_now]
↻ recompile owl 0.9.0* [uses eigen, dune, base, etc.]
↗ upgrade ppx_custom_printf v0.13.0 to v0.14.0
∗ install ppx_hash v0.14.0 [required by ppx_base]
↗ upgrade ppx_assert v0.13.0 to v0.14.0
↗ upgrade sexplib v0.13.0 to v0.14.0
↗ upgrade ppx_fields_conv v0.13.0 to v0.14.1
↗ upgrade utop 2.4.3 to 2.6.0
∗ install ppx_base v0.14.0 [required by time_now]
∗ install jst-config v0.14.0 [required by time_now]
∗ install time_now v0.14.0 [required by ppx_inline_test]
↗ upgrade ppx_inline_test v0.13.0 to v0.14.1
↗ upgrade ppx_expect v0.13.0 to v0.14.0
↗ upgrade torch 0.8 to 0.11
I hope it will save people time for debugging in the future.
Thanks, Gwonsoo
Programs I write using ocaml-torch that use GPU acceleration segfault right before terminating:
This is not a huge deal as it happens when the program is about to terminate anyway but I was wondering if you had observed the same phenomenon.
In particular, I replicated the problem on your
mnist/conv
andchar_rnn
examples.