Closed nilsbecker closed 1 year ago
The intent is for the libtorch errors to be converted to ocaml exceptions so that the library users can catch them on the ocaml side. All libtorch errors should automatically be handled this way but there may well be some bugs in the code that does this. Do you know if there is an ocaml exception being thrown here, e.g. can you try to catch it in a try ... with
block?
If not, would you have a small repro so that I can dig more into the issue?
hmm, it did not look like an ocaml exception was raised. the full output looked like this:
utop # let _ = T.backward g;;
libc++abi: terminating with uncaught exception of type c10::Error: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.
Exception raised from unpack at /tmp/pytorch-20221227-13011-z5fwlt/torch/csrc/autograd/saved_variable.cpp:135 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) + 81 (0x10727a981 in libc10.dylib)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 197 (0x107279155 in libc10.dylib)
frame #2: torch::autograd::SavedVariable::unpack(std::__1::shared_ptr<torch::autograd::Node>) const + 1699 (0x119413f13 in libtorch_cpu.dylib)
frame #3: torch::autograd::generated::DotBackward0::apply(std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> >&&) + 83 (0x118598663 in libtorch_cpu.dylib)
frame #4: torch::autograd::Node::operator()(std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> >&&) + 99 (0x1193e5fe3 in libtorch_cpu.dylib)
frame #5: torch::autograd::Engine::evaluate_function(std::__1::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::__1::shared_ptr<torch::autograd::ReadyQueue> const&) + 1977 (0x1193e0009 in libtorch_cpu.dylib)
frame #6: torch::autograd::Engine::thread_main(std::__1::shared_ptr<torch::autograd::GraphTask> const&) + 948 (0x1193dcae4 in libtorch_cpu.dylib)
frame #7: torch::autograd::Engine::execute_with_graph_task(std::__1::shared_ptr<torch::autograd::GraphTask> const&, std::__1::shared_ptr<torch::autograd::Node>, torch::autograd::InputBuffer&&) + 423 (0x1193dc117 in libtorch_cpu.dylib)
frame #8: torch::autograd::Engine::execute(std::__1::vector<torch::autograd::Edge, std::__1::allocator<torch::autograd::Edge> > const&, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> > const&, bool, bool, bool, std::__1::vector<torch::autograd::Edge, std::__1::allocator<torch::autograd::Edge> > const&) + 2022 (0x1193db616 in libtorch_cpu.dylib)
frame #9: torch::autograd::run_backward(std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> > const&, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> > const&, bool, bool, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> > const&, bool, bool) + 2129 (0x1193cbec1 in libtorch_cpu.dylib)
frame #10: torch::autograd::backward(std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> > const&, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> > const&, c10::optional<bool>, bool, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> > const&) + 104 (0x1193cc828 in libtorch_cpu.dylib)
frame #11: torch::autograd::VariableHooks::_backward(at::Tensor const&, c10::ArrayRef<at::Tensor>, c10::optional<at::Tensor> const&, c10::optional<bool>, bool) const + 431 (0x1194172ef in libtorch_cpu.dylib)
frame #12: at::Tensor::_backward(c10::ArrayRef<at::Tensor>, c10::optional<at::Tensor> const&, c10::optional<bool>, bool) const + 75 (0x11605e58b in libtorch_cpu.dylib)
frame #13: at::Tensor::backward(at::Tensor const&, c10::optional<bool>, bool, c10::optional<c10::ArrayRef<at::Tensor> >) const + 376 (0x10778fc38 in dlltorch_core_stubs.so)
frame #14: at_backward + 112 (0x10778fa10 in dlltorch_core_stubs.so)
frame #15: caml__16_at_backward + 23 (0x10774e927 in dlltorch_core_stubs.so)
frame #16: caml_interprete + 11292 (0x106f3194c in ocamlrun)
frame #17: caml_main + 1710 (0x106f34a6e in ocamlrun)
frame #18: main + 12 (0x106f5b9ac in ocamlrun)
frame #19: start + 462 (0x115ceb52e in dyld)
Abort trap: 6
the program is very short.
#require "torch.toplevel";;
open Torch
module T = Tensor
let t1 = T.rand [4]
let t2 = T.rand [4]
let _ = T.set_requires_grad t1 ~r:true
let _ = T.set_requires_grad t1 ~r:true
let g = T.dot t1 t2
let _ = T.requires_grad g
let _ = T.backward g
let _ = T.backward g
Just gave a try to your code in some ocaml program and it results in a proper ocaml exception being raised, it's certainly a bit of a verbose one as it contains the whole c++ stacktrace but that's the intended behavior I guess. A try .. with
block properly catches the exception.
Maybe the behavior is different in utop though I have no clue how exceptions are supposed to be handled there.
oh really? there was so much output that i did not see any ocaml exception being reported. i'll check again.
in utop i cannot catch an ocaml exception; the toplevel exits.
i get an uncaught exception also in compiled native code. output:
dune exec ./lala.exe
Done: 85% (6/7, 1 left) (jobs: 1)libc++abi: terminating with uncaught exception of type c10::Error: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.
Exception raised from unpack at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/saved_variable.cpp:135 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) + 98 (0x10d00c992 in libc10.dylib)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 205 (0x10d00b1cd in libc10.dylib)
frame #2: torch::autograd::SavedVariable::unpack(std::__1::shared_ptr<torch::autograd::Node>) const + 2751 (0x1295da66f in libtorch_cpu.dylib)
frame #3: torch::autograd::generated::DotBackward0::apply(std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> >&&) + 152 (0x128704cc8 in libtorch_cpu.dylib)
frame #4: torch::autograd::Node::operator()(std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> >&&) + 99 (0x1295acea3 in libtorch_cpu.dylib)
frame #5: torch::autograd::Engine::evaluate_function(std::__1::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::__1::shared_ptr<torch::autograd::ReadyQueue> const&) + 1913 (0x1295a24b9 in libtorch_cpu.dylib)
frame #6: torch::autograd::Engine::thread_main(std::__1::shared_ptr<torch::autograd::GraphTask> const&) + 999 (0x1295a1317 in libtorch_cpu.dylib)
frame #7: torch::autograd::Engine::execute_with_graph_task(std::__1::shared_ptr<torch::autograd::GraphTask> const&, std::__1::shared_ptr<torch::autograd::Node>, torch::autograd::InputBuffer&&) + 900 (0x1295abd64 in libtorch_cpu.dylib)
frame #8: torch::autograd::Engine::execute(std::__1::vector<torch::autograd::Edge, std::__1::allocator<torch::autograd::Edge> > const&, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> > const&, bool, bool, bool, std::__1::vector<torch::autograd::Edge, std::__1::allocator<torch::autograd::Edge> > const&) + 2323 (0x1295a9a43 in libtorch_cpu.dylib)
frame #9: torch::autograd::run_backward(std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> > const&, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> > const&, bool, bool, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> > const&, bool, bool) + 2167 (0x129590ef7 in libtorch_cpu.dylib)
frame #10: torch::autograd::backward(std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> > const&, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> > const&, c10::optional<bool>, bool, std::__1::vector<at::Tensor, std::__1::allocator<at::Tensor> > const&) + 100 (0x129591454 in libtorch_cpu.dylib)
frame #11: torch::autograd::VariableHooks::_backward(at::Tensor const&, c10::ArrayRef<at::Tensor>, c10::optional<at::Tensor> const&, c10::optional<bool>, bool) const + 455 (0x1295dfdc7 in libtorch_cpu.dylib)
frame #12: at::Tensor::_backward(c10::ArrayRef<at::Tensor>, c10::optional<at::Tensor> const&, c10::optional<bool>, bool) const + 75 (0x12621b00b in libtorch_cpu.dylib)
frame #13: at::Tensor::backward(at::Tensor const&, c10::optional<bool>, bool, c10::optional<c10::ArrayRef<at::Tensor> >) const + 376 (0x10bbdc8f8 in lala.exe)
frame #14: at_backward + 112 (0x10bbdc6d0 in lala.exe)
frame #15: caml__16_at_backward + 23 (0x10bb9b5e7 in lala.exe)
frame #16: camlTorch_core__Torch_generated__anon_fn$5btorch_generated$2eml$3a49725$2c2$2d$2d116$5d_918 + 25 (0x10b87f049 in lala.exe)
Abort trap: 6
program:
open Torch
module T = Tensor
let t1 = T.rand [4]
let t2 = T.rand [4]
let _ = T.set_requires_grad t1 ~r:true
let _ = T.set_requires_grad t1 ~r:true
let g = T.dot t1 t2
let _ = T.requires_grad g
let _ = T.backward g
let _ =
try
T.backward g
with _ -> print_string "exception caught";
exit 0
dune file:
(executable
(name lala)
(libraries torch))
macos 12.6.1, ocaml 4.14.1, libtorch 1.13.0, torch 0.17.
i had previously installed pytorch via homebrew which gave linking errors on installation of ocaml-torch because apparently the homebrew and opam installations of the torch dynamic libraries interfered. i have since removed the homebrew package and was then able to install ocaml-torch via opam install torch
so i don't think it's a problem from installation of pytorch. not 100% sure though.
Maybe I'm missing something but in your ocaml program, you have a call to T.backward g
that is outside of the try..catch
block?
yes, the first one is a correct use of a backward pass. the second one is the one that raises. it appears to be forbidden in pytorch, because the tape is already erased, at least that's how i interpret the first line of the error message. doing multiple backward passes is only allowed if retain_graph=True
was specified beforehand at the appropriate place.
the double backwards pass is not the only way to trigger an uncaught exception; that also happened when i tried to slice a tensor along a non-existand dimension. if necessary i can try to make a repro case for that too.
can you reproduce the uncaught exception when copy-pasting my example? i would be happy to help with further testing.
It all seems to work fine on my side, running your code above I get the following with the current github tip version.
$ dune exec examples/basics/basics.exe
exception caught
Not sure what is going on there, sadly I don't have access to a mac to see if it's related to this or to something else.
damn. so it's really either mac specific or something went wrong with my installation. fwiw, i can successfully run simple correct programs, so pytorch is installed in some sort of a working state. possibly something to do with differences in the compiler toolchain linux vs macos, and how exceptions are handled? just guessing now, and i'm definitely no expert. if anyone else reading this is, i'd be happy to test.
Actually I came across the following ocaml issue which seems related and would explain the different behavior on macos. Maybe you could give a try to adding -Wl,-keep_dwarf_unwind
to the c_library_flags
in this dune file and if this helps we could have a look at how to make this the default for macos builds.
i tried this. i modified the dune file as described in my local clone of ocaml-torch. i then created a new 4.14.1+flambda switch and pinned torch to the local directory, followed by opam install .
to my dismay, installation of ctypes already failed with some problems finding ffi.h
. annoying. i will report when i get something installed at least
aha! i managed to work around the ctypes problems. i can now install the opam pinned patched version of ocaml-torch from your last post.
it worked!
dune exec ./lala.exe
Done: 57% (4/7, 3 left) (jobs: 0)exception caught
for reference, the patched dune file is this:
(library
(name torch_core)
(public_name torch.core)
(c_names torch_stubs)
(cxx_names torch_api)
(cxx_flags -std=c++14 -fPIC (:include cxx_flags.sexp))
(c_library_flags :standard -Wl,-keep_dwarf_unwind -lstdc++ (:include c_library_flags.sexp))
(libraries bigarray ctypes.foreign ctypes.stubs ctypes))
Thanks, I've just pushed some changes to enable this flag but only on the macos platforms. Could you give a quick try at the current github tip to check that it's still all good on macos?
on my other machine i was unable to dune build
the new tip just now. but that's due to the anaconda-installed pytorch being picked up instead of the vendored one. that's a different issue.
i'll be able to test in a clean environment tomorrow.
ok i tried with the github main branch tip and i can successfully catch the exception in the test program. this seems to work now!
about the other problem of installing torch on macos when other libtorch installations are present: should i open another issue for that?
Great, thanks for confirming. Yes please open a separate issue for the different libtorch installations bit, hopefully this is not too much macos specific and I can help.
Closing this as the exception issue should be all fixed now, feel free to re-open if further issues.
i was surprised to see that when trying nonsensical gradient operations, almost always the program exists with an uncaught exception. (not used to this at all in ocaml ;) )
an example:
is this intentional / unavoidable?