deephealthproject / eddl

European Distributed Deep Learning (EDDL) library. A general-purpose library initially developed to cover deep learning needs in healthcare use cases within the DeepHealth project.
https://deephealthproject.github.io/eddl/
MIT License
34 stars 10 forks source link

Loss value increases and gets to -nan with increasing batch size #243

Closed simleo closed 3 years ago

simleo commented 3 years ago

CC: @giobus75

In the develop branch (b64f2b3191c56d90a32aed7cf046dd31b5823a83), I got the following results by running the mnist_mlp_regularizers C++ example while changing the batch size:

BS = 100 (original example)

5 epochs of 600 batches of size 100
Epoch 1
Batch 600 softmax4 ( loss[softmax_cross_entropy]=0.2438 metric[categorical_accuracy]=0.9271 ) -- 0.0072 secs/batch
4.3278 secs/epoch
Epoch 2
Batch 600 softmax4 ( loss[softmax_cross_entropy]=0.1108 metric[categorical_accuracy]=0.9666 ) -- 0.0045 secs/batch
2.7062 secs/epoch
Epoch 3
Batch 600 softmax4 ( loss[softmax_cross_entropy]=0.0875 metric[categorical_accuracy]=0.9730 ) -- 0.0041 secs/batch
2.4814 secs/epoch
Epoch 4
Batch 600 softmax4 ( loss[softmax_cross_entropy]=0.0705 metric[categorical_accuracy]=0.9789 ) -- 0.0051 secs/batch
3.0775 secs/epoch
Epoch 5
Batch 600 softmax4 ( loss[softmax_cross_entropy]=0.0609 metric[categorical_accuracy]=0.9814 ) -- 0.0042 secs/batch
2.5113 secs/epoch
Evaluate with batch size 100
Batch 100 softmax4 ( loss[softmax_cross_entropy]=0.0961 metric[categorical_accuracy]=0.9704 ) -- 

BS = 300

5 epochs of 200 batches of size 300
Epoch 1
Batch 200 softmax4 ( loss[softmax_cross_entropy]=0.3831 metric[categorical_accuracy]=0.8881 ) -- 0.0105 secs/batch
2.1010 secs/epoch
Epoch 2
Batch 200 softmax4 ( loss[softmax_cross_entropy]=0.1045 metric[categorical_accuracy]=0.9683 ) -- 0.0057 secs/batch
1.1315 secs/epoch
Epoch 3
Batch 200 softmax4 ( loss[softmax_cross_entropy]=0.0792 metric[categorical_accuracy]=0.9758 ) -- 0.0059 secs/batch
1.1708 secs/epoch
Epoch 4
Batch 200 softmax4 ( loss[softmax_cross_entropy]=0.0594 metric[categorical_accuracy]=0.9814 ) -- 0.0057 secs/batch
1.1349 secs/epoch
Epoch 5
Batch 200 softmax4 ( loss[softmax_cross_entropy]=0.0550 metric[categorical_accuracy]=0.9825 ) -- 0.0054 secs/batch
1.0866 secs/epoch
Evaluate with batch size 100
Batch 100 softmax4 ( loss[softmax_cross_entropy]=0.0785 metric[categorical_accuracy]=0.9776 ) -- 

BS = 450

5 epochs of 133 batches of size 450
Epoch 1
Batch 133 softmax4 ( loss[softmax_cross_entropy]=0.5394 metric[categorical_accuracy]=0.8446 ) -- 0.0074 secs/batch
0.9820 secs/epoch
Epoch 2
Batch 133 softmax4 ( loss[softmax_cross_entropy]=0.1202 metric[categorical_accuracy]=0.9635 ) -- 0.0078 secs/batch
1.0439 secs/epoch
Epoch 3
Batch 133 softmax4 ( loss[softmax_cross_entropy]=0.0844 metric[categorical_accuracy]=0.9741 ) -- 0.0084 secs/batch
1.1114 secs/epoch
Epoch 4
Batch 133 softmax4 ( loss[softmax_cross_entropy]=0.0626 metric[categorical_accuracy]=0.9804 ) -- 0.0094 secs/batch
1.2486 secs/epoch
Epoch 5
Batch 133 softmax4 ( loss[softmax_cross_entropy]=0.0638 metric[categorical_accuracy]=0.9807 ) -- 0.0094 secs/batch
1.2445 secs/epoch
Evaluate with batch size 100
Batch 100 softmax4 ( loss[softmax_cross_entropy]=0.1177 metric[categorical_accuracy]=0.9684 ) --

BS = 500

5 epochs of 120 batches of size 500
Epoch 1
Batch 120 softmax4 ( loss[softmax_cross_entropy]=-nan metric[categorical_accuracy]=0.1182 ) -- 0.0105 secs/batchh
1.2639 secs/epoch
Epoch 2
Batch 120 softmax4 ( loss[softmax_cross_entropy]=-nan metric[categorical_accuracy]=0.0976 ) -- 0.0075 secs/batch
0.9027 secs/epoch
Epoch 3
Batch 120 softmax4 ( loss[softmax_cross_entropy]=-nan metric[categorical_accuracy]=0.0997 ) -- 0.0093 secs/batch
1.1220 secs/epoch
Epoch 4
Batch 120 softmax4 ( loss[softmax_cross_entropy]=-nan metric[categorical_accuracy]=0.0981 ) -- 0.0094 secs/batch
1.1308 secs/epoch
Epoch 5
Batch 120 softmax4 ( loss[softmax_cross_entropy]=-nan metric[categorical_accuracy]=0.0995 ) -- 0.0076 secs/batch
0.9071 secs/epoch
Evaluate with batch size 100
Batch 100 softmax4 ( loss[softmax_cross_entropy]=-nan metric[categorical_accuracy]=0.0980 ) -- 

The metric value also seems to be affected.

RParedesPalacios commented 3 years ago

It's solved.