LaurentMazare / tch-rs

Rust bindings for the C++ api of PyTorch.
Apache License 2.0
4.28k stars 340 forks source link

Please document how to use models created via tracing Rust-defined networks #756

Open emchristiansen opened 1 year ago

emchristiansen commented 1 year ago

Please document how to use models created via tracing Rust-defined networks. An end-to-end example showing how to do everything in Rust would be very helpful!

I'm currently attempting to use a Rust-defined network that has been saved using CModule::save and loaded using TrainableCModule::load, and I'm seeing NaNs in my output that aren't there otherwise.

FYI, if you create a traced model using examples/jit-trace and then try to train it using examples/jit-train, you get the error:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Torch("element 0 of tensors does not require grad and does not have a grad_fn\nException raised from run_backward at /opt/conda/conda-bld/pytorch_1682343995026/work/torch/cs
rc/autograd/autograd.cpp:97 (most recent call first):\nframe #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f863938b4d7 in /home/ubuntu/anaconda3/lib/python3.10/site-packages/torch/lib/libc10.so)\nframe #1: c10::detail::torch
CheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f863935536b in /home/ubuntu/anaconda3/lib/python3.10/site-packages/torch/lib/libc10.so)\nframe #2: <unknown function> + 0x46543b0 (0x7f863da413b0 in /home/ubuntu/an
aconda3/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #3: torch::autograd::backward(std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, c10::optional<bool>, b
ool, std::vector<at::Tensor, std::allocator<at::Tensor> > const&) + 0x5c (0x7f863da4342c in /home/ubuntu/anaconda3/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #4: <unknown function> + 0x46aacae (0x7f863da97cae in /home/ubuntu/
anaconda3/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #5: at::Tensor::_backward(c10::ArrayRef<at::Tensor>, c10::optional<at::Tensor> const&, c10::optional<bool>, bool) const + 0x48 (0x7f863ab20558 in /home/ubuntu/anaconda3/lib
/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #6: <unknown function> + 0x2c901 (0x558647e4e901 in ./target/release/examples/jit-train)\nframe #7: <unknown function> + 0x22677 (0x558647e44677 in ./target/release/examples/jit-train)\
nframe #8: <unknown function> + 0x159ee (0x558647e379ee in ./target/release/examples/jit-train)\nframe #9: <unknown function> + 0x16143 (0x558647e38143 in ./target/release/examples/jit-train)\nframe #10: <unknown function> + 0x16b0d (0x558647e38
b0d in ./target/release/examples/jit-train)\nframe #11: <unknown function> + 0x4a1ce (0x558647e6c1ce in ./target/release/examples/jit-train)\nframe #12: <unknown function> + 0x16135 (0x558647e38135 in ./target/release/examples/jit-train)\nframe
#13: <unknown function> + 0x29d90 (0x7f863902ad90 in /lib/x86_64-linux-gnu/libc.so.6)\nframe #14: __libc_start_main + 0x80 (0x7f863902ae40 in /lib/x86_64-linux-gnu/libc.so.6)\nframe #15: <unknown function> + 0x15365 (0x558647e37365 in ./target/r
elease/examples/jit-train)\n")', src/wrappers/tensor.rs:300:27
emchristiansen commented 1 year ago

More to the point, is it possible to train models defined in Rust, saved, and then re-loaded?

I've found training to be very slow in Rust with a Rust-defined model, and I was hoping I could trace the Rust-defined model to TorchScript to speed it up.