Setting seed does not allow to exactly reproduce results

johann-petrak commented 5 years ago

The seed is used or should get used for shuffling the dataset and random weight initialisation need to check where setting the seed does not have the proper impact.

johann-petrak commented 5 years ago

Testing this with the gate-lf-tests/cl-sentclass/model-pytorch-multifeat1-l dataset and --seed 123: ./train.sh --seed 123 data/crvd.meta.json data/model

Run1: stopped after 5 epochs, training losses: 103.7733, 85.8920, 78.1588, 70.2469, validation losses: 0.7041, 0.7523, 0.7162, 0.7197
Run2: stopped after 11 epochs, first training losses: 103.2578, 85.2654, 78.1462, 69.4860, validation losses: 0.6954, 0.7300, 0.6756, 0.6776

The converted train and val datasets are identical between runs so the data shuffling is probably not the cause.

johann-petrak commented 5 years ago

According to https://pytorch.org/docs/stable/notes/randomness.html the torch.manual_seed() method should set both the CPU and CUDA RNG seeds.

Retry after making sure we set all global RNGs before each training step AND we set the global RNGs in the dataset so that random embeddings are created from the numpy seed set properly:

Copy data to several different dirs, run on CUDA:
Run1-1: stop after 11 epochs val losses: 0.6871, 0.7428, 0.7054, 0.6817, ... 0.6923 Training losses: 103.4056, 84.3498, 76.7454, 68.1419 ... 14.8852
Run2-1: stop after 11, val losses: 0.6871, 0.7428, 0.7054, 0.6806 ... 0.6965 (TINY DIFFERENCE!) Training losses: 103.4056, 84.3498, 76.7454, 68.1504 ... 14.8248 (TINY DIFFERENCES!)
Run3-1: stop after 11, val losses: 0.6871, 0.7428, 0.7054, 0.6778 ... 0.6913 (DIFF!) Training losses: 103.4056, 84.3498, 76.7454, 68.1492 ... 14.7923 (DIFF)
Run1-2: stop after 11, val losses: 0.6871, 0.7427, 0.7132, 0.6863, ... 0.6854
Run1-3: stop after 5, val losses: 0.6871, 0.7428, 0.7135, 0.6922

Run on CPU:

Run1-1: stop after 11, val losses: 0.7683, 0.8173, 0.6716, 0.7072, ... 0.7256
Run2-1: exactly the same
Run1-2: exactly the same

So this seems to be fixed: with the CPU we get full repeatability, with CUDA not quite.

johann-petrak commented 5 years ago

Try setting the cudnn backend mode to deterministic.

Run1-1: 11 epochs, val losses: 0.6871, 0.7427, 0.7131, 0.6855, ... 0.6925
Run1-2: exactly same
Run1-3: exactly same
Run2-1: exactly same

OK, this is fixed now!

GateNLP / gate-lf-pytorch-json

Setting seed does not allow to exactly reproduce results #34