deephealthproject / eddl

European Distributed Deep Learning (EDDL) library. A general-purpose library initially developed to cover deep learning needs in healthcare use cases within the DeepHealth project.
https://deephealthproject.github.io/eddl/
MIT License
34 stars 10 forks source link

Dice metric poor results #223

Closed MicheleCancilla closed 3 years ago

MicheleCancilla commented 3 years ago

Describe the bug The Dice metric always shows poor results on training set. On the other hand, the dice computed by use_case_pipeline gives good results on the validation set.

...
Validation - Epoch 0/149 - volume 1/1 - batch 15/15 - Load time: 0.0258366 - - Dice: 0.598712 - Dice: 1 - Dice: 1 - Dice: 0.448295 - Dice: 1 - Dice: 1 - Dice: 1 - Dice: 1.25e-08 - Dice: 3.84615e-08 - Dice: 1 - Dice: 1 - Dice: 1 - Dice: 0.678279 - Dice: 0.714657 - Dice: 1 - Dice: 0.438819  - Validation time: 0.521612
----------------------------
Mean Dice Coefficient: 0.723106
----------------------------
Saving weights...
Epoch 1/149 - volume 0/9 - batch 0/15 - Load time: 0.123144 - Batch 0 sigmoid11 ( loss[cross_entropy]=176.189 metric[dice]=0.015 ) -- Train time: 1.13445
Epoch 1/149 - volume 0/9 - batch 1/15 - Load time: 0.0637485 - Batch 1 sigmoid11 ( loss[cross_entropy]=152.330 metric[dice]=0.010 ) -- Train time: 1.15347
Epoch 1/149 - volume 0/9 - batch 2/15 - Load time: 0.0650308 - Batch 2 sigmoid11 ( loss[cross_entropy]=139.343 metric[dice]=0.021 ) -- Train time: 1.14352

To Reproduce Steps to reproduce the behavior:

  1. Clone and build use_case_pipeline.
  2. Launch MS_SEGMENTATION_TRAINING script.

Expected behavior The Dice on training should be greater than or equal to the one on validation.

RParedesPalacios commented 3 years ago

In some cases the metric on training can be lower than validation. Usually it happens when performing strong data augmentation or noise in training set. It could be interesting for instance to calculate also the Dice on original training set after the training as well to see what's happening.

Also remember that the metric evaluation while you are training is an average for all the batches and normally the first batches are very low, then the metric while training use to be pessimistic. However the metric when you run the evaluation over validation dataset is using a better model and without any data augmentation.

@MicheleCancilla please can you run this?:

  1. Run one epoch of training
    1. Evaluate validation
    2. Evaluate training

Thanks