LaurentMazare / tch-rs

Rust bindings for the C++ api of PyTorch.
Apache License 2.0
4.28k stars 340 forks source link

Discrepancy in output between same model (with same weights) in tch-rs and PyTorch #409

Closed marc-dlf closed 3 years ago

marc-dlf commented 3 years ago

Hi, I am encountering some difficulties while trying to use in Rust a model which has been trained and saved from PyTorch. I observed some slight differences in the output which affect a great deal the final goal I am trying to achieve.

I have tried to reproduce this on some simpler networks and I also observed some differences (in the case the network contains convolution layers). You can find below the model with Conv2D in tch-rs that I use:

Simple network tch-rs (with Conv2d layers) ```rust use anyhow::Result; use std::borrow::Borrow; use std::path::PathBuf; use tch::{nn, nn::ModuleT, nn::OptimizerConfig, Device, Kind, Tensor}; #[derive(Debug)] struct Net { conv1: nn::Conv2D, conv2: nn::Conv2D, fc1: nn::Linear, fc2: nn::Linear, } impl Net { fn new<'p, P>(vs: P) -> Net where P: Borrow>, { let p = vs.borrow(); let conv1 = nn::conv2d(p / "conv1", 1, 32, 5, Default::default()); let conv2 = nn::conv2d(p / "conv2", 32, 64, 5, Default::default()); let fc1 = nn::linear(p / "fc1", 1024, 1024, Default::default()); let fc2 = nn::linear(p / "fc2", 1024, 1, Default::default()); Net { conv1, conv2, fc1, fc2, } } } impl nn::ModuleT for Net { fn forward_t(&self, xs: &Tensor, train: bool) -> Tensor { xs.apply(&self.conv1) .max_pool2d_default(2) .apply(&self.conv2) .max_pool2d_default(2) .view([-1, 1024]) .apply(&self.fc1) .relu() .apply(&self.fc2) } } #[cfg(test)] mod tests { use super::*; #[test] fn test() -> Result<(), anyhow::Error> { let device = Device::cuda_if_available(); let model_home = PathBuf::from("/path/to/model_dir"); let model_dir: PathBuf = model_home.into(); let weights_file = model_dir.join("model.ot"); let mut vs = nn::VarStore::new(device); let simple_net = Net::new(&vs.root()); vs.load(weights_file)?; let sample_input = Tensor::ones(&[4, 1, 28, 28], (Kind::Float, device)).set_requires_grad(true); let output = simple_net.forward_t(&sample_input, false); output.save("/path/to/outputs.pt"); Ok(()) } } ```
Simple network Pytorch (with Conv2d layers) ```python import torch import torch.nn as nn import torch.nn.functional as F class Net(nn.Module): def __init__(self): super(Net, self).__init__() self.conv1 = nn.Conv2d(1, 32, 5, 1) self.conv2 = nn.Conv2d(32, 64, 5, 1) self.fc1 = nn.Linear(1024, 1024) self.fc2 = nn.Linear(1024, 1) def forward(self, x): x = self.conv1(x) x = F.max_pool2d(x, 2) x = self.conv2(x) x = F.max_pool2d(x, 2) x = x.view(-1,1024) x = self.fc1(x) x = F.relu(x) output = self.fc2(x) return output imgs = torch.ones([4,1,28,28],dtype=torch.float32,requires_grad=True).to("cuda:0") output = simple_net(imgs) ```

When I load and compare the outputs of both models in python, this yields a difference of tensor([[5.8487e-07], [5.8487e-07], [5.8487e-07], [5.8487e-07]], device='cuda:0', grad_fn=<SubBackward0>)

In the case I don't put convolutional layers, the output is exactly 0:

Simple network tch-rs (with only linear layers) ```rust use anyhow::Result; use std::borrow::Borrow; use std::path::PathBuf; use tch::{nn, nn::ModuleT, nn::OptimizerConfig, Device, Kind, Tensor}; #[derive(Debug)] struct Net { fc1: nn::Linear, fc2: nn::Linear, } impl Net { fn new<'p, P>(vs: P) -> Net where P: Borrow>, { let p = vs.borrow(); let fc1 = nn::linear(p / "fc1", 1024, 1024, Default::default()); let fc2 = nn::linear(p / "fc2", 1024, 1, Default::default()); Net { fc1, fc2 } } } impl nn::ModuleT for Net { fn forward_t(&self, xs: &Tensor, train: bool) -> Tensor { xs.apply(&self.fc1).relu().apply(&self.fc2) } } #[cfg(test)] mod tests { use super::*; #[test] fn test() -> Result<(), anyhow::Error> { let device = Device::cuda_if_available(); let model_home = PathBuf::from("/path/to/model_dir"); let model_dir: PathBuf = model_home.into(); let weights_file = model_dir.join("model.ot"); let mut vs = nn::VarStore::new(device); let simple_net = Net::new(&vs.root()); vs.load(weights_file)?; let sample_input = Tensor::ones(&[4, 1024], (Kind::Float, device)).set_requires_grad(true); let output = simple_net.forward_t(&sample_input, false); output.save("/path/to/outputs.pt"); Ok(()) } } ```
Simple network Pytorch (with only linear layers) ```python import torch import torch.nn as nn import torch.nn.functional as F class Net(nn.Module): def __init__(self): super(Net, self).__init__() self.fc1 = nn.Linear(1024, 1024) self.fc2 = nn.Linear(1024, 1) def forward(self, x): x = self.fc1(x) x = F.relu(x) output = self.fc2(x) return output imgs = torch.ones([4,1024],dtype=torch.float32,requires_grad=True).to("cuda:0") output = simple_net(imgs) ```

There as I said, when I make the difference between the output in rust and python, this yields this output tensor([[0.], [0.], [0.], [0.]], device='cuda:0', grad_fn=<SubBackward0>) where everything is perfectly equal.

Just to give details about the method I use, here is the flow : 1) I initialize the network (with random weights) with Pytorch and save its weights 2) I transform it so it can be loaded in Rust (using the method described in rust-bert https://github.com/guillaume-be/rust-bert/blob/master/src/convert-tensor.rs) 3) I load it in Rust, compute the output and save it 4) I compare the outputs

In the actual model I use, there is a GRU layer and I notice the same kind of problem.

I checked that the weights I saved and that I load in Rust have the exact same value as the one in Python and both results are obtained using the same GPU with torch version '1.9.0+cu102'. This leaves me wondering what can cause this small difference as, to the best of my understanding, Pytorch and Rust call the same C++ backend and as I am using the same machine in both cases.

To be more precise, do you think this kind of difference is inevitable or is there something else (maybe the underlying C++ code called in both cases is not exactly the same as I think it is)?

Thank you for your great work and for helping me!

LaurentMazare commented 3 years ago

I actually gave this a try but got exactly the same values from the Python and the Rust sides. To give more details about what I did, I extracted the initial weights from Python via the following:

nps = {}
for k, v in simple_net.state_dict().items(): nps[k] = v.numpy()
np.savez('/tmp/mymodel.npz', **nps)

Then I converted this to the format expected by the crate via:

cargo run --example tensor-tools cp /tmp/mymodel.npz /tmp/mymodel.ot

I ran your Rust code and finally the following Python code which outputted only zeros.

imgs = torch.ones([4,1,28,28],dtype=torch.float32,requires_grad=True)
output = simple_net(imgs)
(output - torch.load("/tmp/outputs.pt").state_dict()["0"]).pow(2).sum()

One thing that might be different is that I used the cpu backend so you may want to give this a try rather than go through cuda. If you still see a diff, one thing you may want to try is on the Python side to load the weights from the same file as used on the Rust side, e.g. via the following code.

_mdl = torch.load("/tmp/mymodel.ot")
for k, v in _mdl.state_dict().items():
    _k1, _k2 = k.split("|")
    setattr(getattr(simple_net, _k1), _k2, nn.Parameter(v))
marc-dlf commented 3 years ago

Thank you for your quick answer!

You are right, when I run the experiment again on the CPU, it works fine. Oddly, there is still a tiny difference when I run this on my GPU but only in the case I run another model (which has nothing to do with this) before this experiment (in the same Ipython notebook). When I run it in a notebook I just restarted, this difference doesn't exist. I still wonder where this residual may come from.

Actually this permitted me to notice that this tiny difference was not the cause of my problems so thank you again and have a nice day!