LaurentMazare / ocaml-torch

OCaml bindings for PyTorch
Apache License 2.0
412 stars 38 forks source link

most exceptions from libtorch not caught (macos) #78

Closed nilsbecker closed 1 year ago

nilsbecker commented 1 year ago

i was surprised to see that when trying nonsensical gradient operations, almost always the program exists with an uncaught exception. (not used to this at all in ocaml ;) )

an example:

libc++abi: terminating with uncaught exception of type c10::Error: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

is this intentional / unavoidable?

LaurentMazare commented 1 year ago

The intent is for the libtorch errors to be converted to ocaml exceptions so that the library users can catch them on the ocaml side. All libtorch errors should automatically be handled this way but there may well be some bugs in the code that does this. Do you know if there is an ocaml exception being thrown here, e.g. can you try to catch it in a try ... with block? If not, would you have a small repro so that I can dig more into the issue?

nilsbecker commented 1 year ago

hmm, it did not look like an ocaml exception was raised. the full output looked like this:

utop # let _ = T.backward g;;
libc++abi: terminating with uncaught exception of type c10::Error: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.
Exception raised from unpack at /tmp/pytorch-20221227-13011-z5fwlt/torch/csrc/autograd/saved_variable.cpp:135 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) + 81 (0x10727a981 in libc10.dylib)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 197 (0x107279155 in libc10.dylib)
frame #2: torch::autograd::SavedVariable::unpack(std::__1::shared_ptr<torch::autograd::Node>) const + 1699 (0x119413f13 in libtorch_cpu.dylib)
frame #3: torch::autograd::generated::DotBackward0::apply(std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> >&&) + 83 (0x118598663 in libtorch_cpu.dylib)
frame #4: torch::autograd::Node::operator()(std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> >&&) + 99 (0x1193e5fe3 in libtorch_cpu.dylib)
frame #5: torch::autograd::Engine::evaluate_function(std::__1::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::__1::shared_ptr<torch::autograd::ReadyQueue> const&) + 1977 (0x1193e0009 in libtorch_cpu.dylib)
frame #6: torch::autograd::Engine::thread_main(std::__1::shared_ptr<torch::autograd::GraphTask> const&) + 948 (0x1193dcae4 in libtorch_cpu.dylib)
frame #7: torch::autograd::Engine::execute_with_graph_task(std::__1::shared_ptr<torch::autograd::GraphTask> const&, std::__1::shared_ptr<torch::autograd::Node>, torch::autograd::InputBuffer&&) + 423 (0x1193dc117 in libtorch_cpu.dylib)
frame #8: torch::autograd::Engine::execute(std::__1::vector<torch::autograd::Edge, std::__1::allocator<torch::autograd::Edge> > const&, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> > const&, bool, bool, bool, std::__1::vector<torch::autograd::Edge, std::__1::allocator<torch::autograd::Edge> > const&) + 2022 (0x1193db616 in libtorch_cpu.dylib)
frame #9: torch::autograd::run_backward(std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> > const&, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> > const&, bool, bool, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> > const&, bool, bool) + 2129 (0x1193cbec1 in libtorch_cpu.dylib)
frame #10: torch::autograd::backward(std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> > const&, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> > const&, c10::optional<bool>, bool, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> > const&) + 104 (0x1193cc828 in libtorch_cpu.dylib)
frame #11: torch::autograd::VariableHooks::_backward(at::Tensor const&, c10::ArrayRef<at::Tensor>, c10::optional<at::Tensor> const&, c10::optional<bool>, bool) const + 431 (0x1194172ef in libtorch_cpu.dylib)
frame #12: at::Tensor::_backward(c10::ArrayRef<at::Tensor>, c10::optional<at::Tensor> const&, c10::optional<bool>, bool) const + 75 (0x11605e58b in libtorch_cpu.dylib)
frame #13: at::Tensor::backward(at::Tensor const&, c10::optional<bool>, bool, c10::optional<c10::ArrayRef<at::Tensor> >) const + 376 (0x10778fc38 in dlltorch_core_stubs.so)
frame #14: at_backward + 112 (0x10778fa10 in dlltorch_core_stubs.so)
frame #15: caml__16_at_backward + 23 (0x10774e927 in dlltorch_core_stubs.so)
frame #16: caml_interprete + 11292 (0x106f3194c in ocamlrun)
frame #17: caml_main + 1710 (0x106f34a6e in ocamlrun)
frame #18: main + 12 (0x106f5b9ac in ocamlrun)
frame #19: start + 462 (0x115ceb52e in dyld)

Abort trap: 6
nilsbecker commented 1 year ago

the program is very short.

#require "torch.toplevel";;

open Torch

module T = Tensor
let t1 = T.rand [4]
let t2 = T.rand [4]

let _ = T.set_requires_grad t1 ~r:true
let _ = T.set_requires_grad t1 ~r:true

let g = T.dot t1 t2

let _ = T.requires_grad g

let _ = T.backward g

let _ = T.backward g
LaurentMazare commented 1 year ago

Just gave a try to your code in some ocaml program and it results in a proper ocaml exception being raised, it's certainly a bit of a verbose one as it contains the whole c++ stacktrace but that's the intended behavior I guess. A try .. with block properly catches the exception. Maybe the behavior is different in utop though I have no clue how exceptions are supposed to be handled there.

nilsbecker commented 1 year ago

oh really? there was so much output that i did not see any ocaml exception being reported. i'll check again.

nilsbecker commented 1 year ago

in utop i cannot catch an ocaml exception; the toplevel exits.

nilsbecker commented 1 year ago

i get an uncaught exception also in compiled native code. output:

dune exec ./lala.exe
Done: 85% (6/7, 1 left) (jobs: 1)libc++abi: terminating with uncaught exception of type c10::Error: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.
Exception raised from unpack at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/saved_variable.cpp:135 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) + 98 (0x10d00c992 in libc10.dylib)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 205 (0x10d00b1cd in libc10.dylib)
frame #2: torch::autograd::SavedVariable::unpack(std::__1::shared_ptr<torch::autograd::Node>) const + 2751 (0x1295da66f in libtorch_cpu.dylib)
frame #3: torch::autograd::generated::DotBackward0::apply(std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> >&&) + 152 (0x128704cc8 in libtorch_cpu.dylib)
frame #4: torch::autograd::Node::operator()(std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> >&&) + 99 (0x1295acea3 in libtorch_cpu.dylib)
frame #5: torch::autograd::Engine::evaluate_function(std::__1::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::__1::shared_ptr<torch::autograd::ReadyQueue> const&) + 1913 (0x1295a24b9 in libtorch_cpu.dylib)
frame #6: torch::autograd::Engine::thread_main(std::__1::shared_ptr<torch::autograd::GraphTask> const&) + 999 (0x1295a1317 in libtorch_cpu.dylib)
frame #7: torch::autograd::Engine::execute_with_graph_task(std::__1::shared_ptr<torch::autograd::GraphTask> const&, std::__1::shared_ptr<torch::autograd::Node>, torch::autograd::InputBuffer&&) + 900 (0x1295abd64 in libtorch_cpu.dylib)
frame #8: torch::autograd::Engine::execute(std::__1::vector<torch::autograd::Edge, std::__1::allocator<torch::autograd::Edge> > const&, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> > const&, bool, bool, bool, std::__1::vector<torch::autograd::Edge, std::__1::allocator<torch::autograd::Edge> > const&) + 2323 (0x1295a9a43 in libtorch_cpu.dylib)
frame #9: torch::autograd::run_backward(std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> > const&, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> > const&, bool, bool, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> > const&, bool, bool) + 2167 (0x129590ef7 in libtorch_cpu.dylib)
frame #10: torch::autograd::backward(std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> > const&, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> > const&, c10::optional<bool>, bool, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> > const&) + 100 (0x129591454 in libtorch_cpu.dylib)
frame #11: torch::autograd::VariableHooks::_backward(at::Tensor const&, c10::ArrayRef<at::Tensor>, c10::optional<at::Tensor> const&, c10::optional<bool>, bool) const + 455 (0x1295dfdc7 in libtorch_cpu.dylib)
frame #12: at::Tensor::_backward(c10::ArrayRef<at::Tensor>, c10::optional<at::Tensor> const&, c10::optional<bool>, bool) const + 75 (0x12621b00b in libtorch_cpu.dylib)
frame #13: at::Tensor::backward(at::Tensor const&, c10::optional<bool>, bool, c10::optional<c10::ArrayRef<at::Tensor> >) const + 376 (0x10bbdc8f8 in lala.exe)
frame #14: at_backward + 112 (0x10bbdc6d0 in lala.exe)
frame #15: caml__16_at_backward + 23 (0x10bb9b5e7 in lala.exe)
frame #16: camlTorch_core__Torch_generated__anon_fn$5btorch_generated$2eml$3a49725$2c2$2d$2d116$5d_918 + 25 (0x10b87f049 in lala.exe)

Abort trap: 6

program:


open Torch

module T = Tensor
let t1 = T.rand [4]
let t2 = T.rand [4]

let _ = T.set_requires_grad t1 ~r:true
let _ = T.set_requires_grad t1 ~r:true

let g = T.dot t1 t2

let _ = T.requires_grad g

let _ = T.backward g

let _ =
  try
    T.backward g
  with _ -> print_string "exception caught";
    exit 0

dune file:

(executable
 (name lala)
 (libraries torch))
nilsbecker commented 1 year ago

macos 12.6.1, ocaml 4.14.1, libtorch 1.13.0, torch 0.17.

i had previously installed pytorch via homebrew which gave linking errors on installation of ocaml-torch because apparently the homebrew and opam installations of the torch dynamic libraries interfered. i have since removed the homebrew package and was then able to install ocaml-torch via opam install torch

so i don't think it's a problem from installation of pytorch. not 100% sure though.

LaurentMazare commented 1 year ago

Maybe I'm missing something but in your ocaml program, you have a call to T.backward g that is outside of the try..catch block?

nilsbecker commented 1 year ago

yes, the first one is a correct use of a backward pass. the second one is the one that raises. it appears to be forbidden in pytorch, because the tape is already erased, at least that's how i interpret the first line of the error message. doing multiple backward passes is only allowed if retain_graph=True was specified beforehand at the appropriate place.

nilsbecker commented 1 year ago

the double backwards pass is not the only way to trigger an uncaught exception; that also happened when i tried to slice a tensor along a non-existand dimension. if necessary i can try to make a repro case for that too.

nilsbecker commented 1 year ago

can you reproduce the uncaught exception when copy-pasting my example? i would be happy to help with further testing.

LaurentMazare commented 1 year ago

It all seems to work fine on my side, running your code above I get the following with the current github tip version.

$ dune exec examples/basics/basics.exe
exception caught

Not sure what is going on there, sadly I don't have access to a mac to see if it's related to this or to something else.

nilsbecker commented 1 year ago

damn. so it's really either mac specific or something went wrong with my installation. fwiw, i can successfully run simple correct programs, so pytorch is installed in some sort of a working state. possibly something to do with differences in the compiler toolchain linux vs macos, and how exceptions are handled? just guessing now, and i'm definitely no expert. if anyone else reading this is, i'd be happy to test.

LaurentMazare commented 1 year ago

Actually I came across the following ocaml issue which seems related and would explain the different behavior on macos. Maybe you could give a try to adding -Wl,-keep_dwarf_unwind to the c_library_flags in this dune file and if this helps we could have a look at how to make this the default for macos builds.

nilsbecker commented 1 year ago

i tried this. i modified the dune file as described in my local clone of ocaml-torch. i then created a new 4.14.1+flambda switch and pinned torch to the local directory, followed by opam install .

to my dismay, installation of ctypes already failed with some problems finding ffi.h. annoying. i will report when i get something installed at least

nilsbecker commented 1 year ago

aha! i managed to work around the ctypes problems. i can now install the opam pinned patched version of ocaml-torch from your last post.

it worked!

dune exec ./lala.exe
Done: 57% (4/7, 3 left) (jobs: 0)exception caught
nilsbecker commented 1 year ago

for reference, the patched dune file is this:

(library
  (name torch_core)
  (public_name torch.core)
  (c_names torch_stubs)
  (cxx_names torch_api)
  (cxx_flags -std=c++14 -fPIC (:include cxx_flags.sexp))
  (c_library_flags :standard -Wl,-keep_dwarf_unwind -lstdc++ (:include c_library_flags.sexp))
  (libraries bigarray ctypes.foreign ctypes.stubs ctypes))
LaurentMazare commented 1 year ago

Thanks, I've just pushed some changes to enable this flag but only on the macos platforms. Could you give a quick try at the current github tip to check that it's still all good on macos?

nilsbecker commented 1 year ago

on my other machine i was unable to dune build the new tip just now. but that's due to the anaconda-installed pytorch being picked up instead of the vendored one. that's a different issue. i'll be able to test in a clean environment tomorrow.

nilsbecker commented 1 year ago

ok i tried with the github main branch tip and i can successfully catch the exception in the test program. this seems to work now!

nilsbecker commented 1 year ago

about the other problem of installing torch on macos when other libtorch installations are present: should i open another issue for that?

LaurentMazare commented 1 year ago

Great, thanks for confirming. Yes please open a separate issue for the different libtorch installations bit, hopefully this is not too much macos specific and I can help.

LaurentMazare commented 1 year ago

Closing this as the exception issue should be all fixed now, feel free to re-open if further issues.