Closed vegapit closed 4 years ago
My guess is that this could be a more fundamental problem: if the random number generation shares some state between threads, then if will be called in some non deterministic order and reproducibility is likely to be very tough. I'm not sure how this work though but I would expect that the C++ api that we use have the same problem so it may be worth searching/asking on PyTorch's github or on the PyTorch forums.
Will do. I have observed inconsistent evaluation calls as well which got me worried. If the underlying library is unstable in a multi threaded context, there is a huge incentive to move to a Rust-only neural network library.
Hello,
I have noticed that training reproducibility goes out of the window when I use multiple threads to train multiple models simultaneously. Is it because
tch::manual_seed
is not reliable on child threads, or is there a more fundamental problem?Cheers