LaurentMazare / tch-rs

Rust bindings for the C++ api of PyTorch.
Apache License 2.0
4.28k stars 340 forks source link

Seed fixing in multi threaded context #263

Closed vegapit closed 4 years ago

vegapit commented 4 years ago

Hello,

I have noticed that training reproducibility goes out of the window when I use multiple threads to train multiple models simultaneously. Is it because tch::manual_seed is not reliable on child threads, or is there a more fundamental problem?

Cheers

LaurentMazare commented 4 years ago

My guess is that this could be a more fundamental problem: if the random number generation shares some state between threads, then if will be called in some non deterministic order and reproducibility is likely to be very tough. I'm not sure how this work though but I would expect that the C++ api that we use have the same problem so it may be worth searching/asking on PyTorch's github or on the PyTorch forums.

vegapit commented 4 years ago

Will do. I have observed inconsistent evaluation calls as well which got me worried. If the underlying library is unstable in a multi threaded context, there is a huge incentive to move to a Rust-only neural network library.