Open tkornuta-ibm opened 5 years ago
validation set: 10 OK
================================================================================
[2018-11-08 12:55:30] - INFO - OfflineTrainer >>> Training finished because Epoch limit reached [2018-11-08 12:55:30] - INFO - OfflineTrainer >>> Validating over the entire validation set (10 samples in 1 episodes) [2018-11-08 12:55:30] - INFO - OfflineTrainer >>> episode 000084; episodes_aggregated 000001; loss 0.0505415723; loss_min 0.0505415723; loss_max 0.0505415723; loss_std 0.0000000000; epoch 00; acc 1.0000000000; acc_min 1.0000000000; acc_max 1.0000000000; acc_std 0.0000000000; samples_aggregated 000010 [Full Validation] [2018-11-08 12:55:30] - INFO - OfflineTrainer >>> Experiment finished!
batch size: 10 NOT OK!!
loss 0.0698996484 vs loss 0.0698996559
================================================================================
[2018-11-08 12:57:16] - INFO - OfflineTrainer >>> Training finished because Epoch limit reached [2018-11-08 12:57:16] - INFO - OfflineTrainer >>> Validating over the entire validation set (10 samples in 1 episodes) [2018-11-08 12:57:16] - INFO - OfflineTrainer >>> episode 000084; episodes_aggregated 000001; loss 0.0698996559; loss_min 0.0698996559; loss_max 0.0698996559; loss_std 0.0000000000; epoch 00; acc 1.0000000000; acc_min 1.0000000000; acc_max 1.0000000000; acc_std 0.0000000000; samples_aggregated 000010 [Full Validation] [2018-11-08 12:57:16] - INFO - OfflineTrainer >>> Experiment finished!
validation with batch of size 1 - works perfectly
validation with batch of size 10 - from time to time returns values that differ
First I was thinking that the issue related to lack of weighted averaging when we are not dropping the last batch. Sadly, the issue remained even when dropping last batch/limiting size of set to batch.
To Reproduce
mip-offline-trainer --c configs/vision/simplecnn_mnist.yaml
Validation problem section:
validation: problem: name: MNIST batch_size: 10 use_train_data: True resize: [32, 32] sampler: name: SubsetRandomSampler indices: [55000, 55010] dataloader: drop_last: True
================================================================================ [2018-11-08 12:36:07] - INFO - OfflineTrainer >>> Validating over the entire validation set (10 samples in 1 episodes) [2018-11-08 12:36:07] - INFO - OfflineTrainer >>> episode 000858; episodes_aggregated 000001; loss 0.0016851807; loss_min 0.0016851807; loss_max 0.0016851807; loss_std 0.0000000000; epoch 00; acc 1.0000000000; acc_min 1.0000000000; acc_max 1.0000000000; acc_std 0.0000000000; samples_aggregated 000010 [Full Validation] [2018-11-08 12:36:07] - INFO - Model >>> Model and statistics exported to checkpoint ./experiments/MNIST/SimpleConvNet/20181108_123547/models/model_best.pt [2018-11-08 12:36:07] - INFO - OfflineTrainer >>>
[2018-11-08 12:36:07] - INFO - OfflineTrainer >>> Training finished because Epoch limit reached [2018-11-08 12:36:07] - INFO - OfflineTrainer >>> Validating over the entire validation set (10 samples in 1 episodes) [2018-11-08 12:36:07] - INFO - OfflineTrainer >>> episode 000858; episodes_aggregated 000001; loss 0.0016851809; loss_min 0.0016851809; loss_max 0.0016851809; loss_std 0.0000000000; epoch 00; acc 1.0000000000; acc_min 1.0000000000; acc_max 1.0000000000; acc_std 0.0000000000; samples_aggregated 000010 [Full Validation]