IBM / mi-prometheus

Enabling reproducible Machine Learning research
http://mi-prometheus.rtfd.io/
Apache License 2.0
42 stars 18 forks source link

Investigate why two validation runs on the same model return slightly different statistics #75

Open tkornuta-ibm opened 5 years ago

tkornuta-ibm commented 5 years ago

validation with batch of size 1 - works perfectly

validation with batch of size 10 - from time to time returns values that differ

First I was thinking that the issue related to lack of weighted averaging when we are not dropping the last batch. Sadly, the issue remained even when dropping last batch/limiting size of set to batch.

To Reproduce

mip-offline-trainer --c configs/vision/simplecnn_mnist.yaml

Validation problem section:

validation: problem: name: MNIST batch_size: 10 use_train_data: True resize: [32, 32] sampler: name: SubsetRandomSampler indices: [55000, 55010] dataloader: drop_last: True

================================================================================ [2018-11-08 12:36:07] - INFO - OfflineTrainer >>> Validating over the entire validation set (10 samples in 1 episodes) [2018-11-08 12:36:07] - INFO - OfflineTrainer >>> episode 000858; episodes_aggregated 000001; loss 0.0016851807; loss_min 0.0016851807; loss_max 0.0016851807; loss_std 0.0000000000; epoch 00; acc 1.0000000000; acc_min 1.0000000000; acc_max 1.0000000000; acc_std 0.0000000000; samples_aggregated 000010 [Full Validation] [2018-11-08 12:36:07] - INFO - Model >>> Model and statistics exported to checkpoint ./experiments/MNIST/SimpleConvNet/20181108_123547/models/model_best.pt [2018-11-08 12:36:07] - INFO - OfflineTrainer >>>

[2018-11-08 12:36:07] - INFO - OfflineTrainer >>> Training finished because Epoch limit reached [2018-11-08 12:36:07] - INFO - OfflineTrainer >>> Validating over the entire validation set (10 samples in 1 episodes) [2018-11-08 12:36:07] - INFO - OfflineTrainer >>> episode 000858; episodes_aggregated 000001; loss 0.0016851809; loss_min 0.0016851809; loss_max 0.0016851809; loss_std 0.0000000000; epoch 00; acc 1.0000000000; acc_min 1.0000000000; acc_max 1.0000000000; acc_std 0.0000000000; samples_aggregated 000010 [Full Validation]

tkornuta-ibm commented 5 years ago

validation set: 10 OK

================================================================================

[2018-11-08 12:55:28] - INFO - OfflineTrainer >>> Starting next epoch: 0 [2018-11-08 12:55:28] - INFO - OfflineTrainer >>> loss 2.3090281487; episode 000000; epoch 00; acc 0.0312500000; batch_size 000064 [2018-11-08 12:55:30] - INFO - OfflineTrainer >>> episode 000084; episodes_aggregated 000085; loss 0.6594894528; loss_min 0.1157590970; loss_max 2.3090281487; loss_std 0.5856138468; epoch 00; acc 0.7722426653; acc_min 0.0312500000; acc_max 0.9687500000; acc_std 0.2274699211; samples_aggregated 005440 [Epoch 0] [2018-11-08 12:55:30] - INFO - OfflineTrainer >>> Validating over the entire validation set (10 samples in 1 episodes) [2018-11-08 12:55:30] - INFO - OfflineTrainer >>> episode 000084; episodes_aggregated 000001; loss 0.0505415723; loss_min 0.0505415723; loss_max 0.0505415723; loss_std 0.0000000000; epoch 00; acc 1.0000000000; acc_min 1.0000000000; acc_max 1.0000000000; acc_std 0.0000000000; samples_aggregated 000010 [Full Validation] [2018-11-08 12:55:30] - INFO - Model >>> Model and statistics exported to checkpoint ./experiments/MNIST/SimpleConvNet/20181108_125528/models/model_best.pt [2018-11-08 12:55:30] - INFO - OfflineTrainer >>>

[2018-11-08 12:55:30] - INFO - OfflineTrainer >>> Training finished because Epoch limit reached [2018-11-08 12:55:30] - INFO - OfflineTrainer >>> Validating over the entire validation set (10 samples in 1 episodes) [2018-11-08 12:55:30] - INFO - OfflineTrainer >>> episode 000084; episodes_aggregated 000001; loss 0.0505415723; loss_min 0.0505415723; loss_max 0.0505415723; loss_std 0.0000000000; epoch 00; acc 1.0000000000; acc_min 1.0000000000; acc_max 1.0000000000; acc_std 0.0000000000; samples_aggregated 000010 [Full Validation] [2018-11-08 12:55:30] - INFO - OfflineTrainer >>> Experiment finished!

tkornuta-ibm commented 5 years ago

batch size: 10 NOT OK!!

loss 0.0698996484 vs loss 0.0698996559

================================================================================

[2018-11-08 12:57:14] - INFO - OfflineTrainer >>> Starting next epoch: 0 [2018-11-08 12:57:14] - INFO - OfflineTrainer >>> loss 2.3131735325; episode 000000; epoch 00; acc 0.1406250000; batch_size 000064 [2018-11-08 12:57:16] - INFO - OfflineTrainer >>> episode 000084; episodes_aggregated 000085; loss 0.5389420390; loss_min 0.0600712560; loss_max 2.3131735325; loss_std 0.5660359263; epoch 00; acc 0.8187500238; acc_min 0.0468750000; acc_max 1.0000000000; acc_std 0.2081174403; samples_aggregated 005440 [Epoch 0] [2018-11-08 12:57:16] - INFO - OfflineTrainer >>> Validating over the entire validation set (10 samples in 1 episodes) [2018-11-08 12:57:16] - INFO - OfflineTrainer >>> episode 000084; episodes_aggregated 000001; loss 0.0698996484; loss_min 0.0698996484; loss_max 0.0698996484; loss_std 0.0000000000; epoch 00; acc 1.0000000000; acc_min 1.0000000000; acc_max 1.0000000000; acc_std 0.0000000000; samples_aggregated 000010 [Full Validation] [2018-11-08 12:57:16] - INFO - Model >>> Model and statistics exported to checkpoint ./experiments/MNIST/SimpleConvNet/20181108_125714/models/model_best.pt [2018-11-08 12:57:16] - INFO - OfflineTrainer >>>

[2018-11-08 12:57:16] - INFO - OfflineTrainer >>> Training finished because Epoch limit reached [2018-11-08 12:57:16] - INFO - OfflineTrainer >>> Validating over the entire validation set (10 samples in 1 episodes) [2018-11-08 12:57:16] - INFO - OfflineTrainer >>> episode 000084; episodes_aggregated 000001; loss 0.0698996559; loss_min 0.0698996559; loss_max 0.0698996559; loss_std 0.0000000000; epoch 00; acc 1.0000000000; acc_min 1.0000000000; acc_max 1.0000000000; acc_std 0.0000000000; samples_aggregated 000010 [Full Validation] [2018-11-08 12:57:16] - INFO - OfflineTrainer >>> Experiment finished!