Open lcaronson opened 3 years ago
Hey @lcaronson,
thanks for your interest in using MIScnn!
I am a little bit unclear how you come up with final 0.9544 Dice coefficient value in the published MIScnn paper. Is there some kind of additional function that can be used to compare the test data to predictions? Or is that value returned during the cross-validation phase?
The DSC of 0.9544 for the kidney segmentation were automatically computed with our cross-validation function (https://github.com/frankkramer-lab/MIScnn/blob/master/miscnn/evaluation/cross_validation.py). With default parameters -> without any callbacks, no validation monitoring is performed by which the returning cross-validation set can be used as testing sets.
However, you can always run a prediction call by yourself and then compute the associated DSCs. You can find an example for this approach in our CellTracking example with 2D microscopy images: https://github.com/frankkramer-lab/MIScnn/blob/master/examples/CellTracking.ipynb
If the cross-validation phase is also doing the testing of the data, then how do we define the ratio of train/validate/test data? For example, my understanding is that of the roughly 300 studies in the KiTS19 dataset, you did 80/90/40 ratio? I am just trying to figure out how you set these parameters in the code.
For the KiTS19, we didn't used any validation set and computed our scores purely on the testing sets from the 3-fold cross-validation. Also we only used a subset of 120 samples. -> 3x (80 train & 40 test) We did this to demonstrate a default approach with any more advanced validation monitoring techniques.
However, in our more recent covid-19 segmentation based on limited data, we used a cross-val (train/val) and testing strategy. https://www.sciencedirect.com/science/article/pii/S2352914821001660?via%3Dihub
In this study, we performed a 5-fold cross-validation on only 20 samples and computed 5 models (each fold returning a model). Then, we computed predictions on a completely separated hold-out set of 100 samples (from another source). We computed the 5 predictions for each samples (from each fold-model one and then averaged these 5 predictions into a single one = ensemble learning). Afterwards, we computed the DSC on the ensembled/final predictions. In the paper, we did also some more fancy stuff to prove that the ensemble learning strategy is highly efficient and MIScnn is capable to produce robust models with it based on even such a low number of samples like 20. Here the complete COVID-19 study code: https://github.com/frankkramer-lab/covid19.MIScnn
As a final question for you, if I have a dataset of 60 studies, would a decent train/validate/test ratio be 30/15/15?
Sadly, there is no clear answer to this question. Personally, I would highly recommend for a 80/20 percentage split into train/test and then run a 3-fold or 5-fold cross-validation on the 80% training data. For testing then utilizing ensembling learning techniques. This is the state-of-the-art approach and will gain great performance. Otherwise, I'm a personal fan of running a 65/15/20 split for train/val/test. It is highly depended on how much data you have. 60 samples are quite low in terms of neural networks (even if it's a very good dataset in a medical perspective due to the complexity to generate annotated medical imaging datasets!!), which is why I'm a big fan of cross-validation.
Hope that I was able to give you some insights/feedback! :)
Cheers, Dominik
then averaged these 5 predictions into a single one
Hi, please how did you average these predictions into one? Did you just average the metrics computed from the 5 predictions for each sample?
Hey @emmanuel-nwogu,
correct. In this study, we just averaged the predictions pixelwise via mean.
Cheers, Dominik
Thanks for the reply. From my understanding, you average the predicted binary masks to generate a final prediction mask. Is there a common name for this in the literature?
Happy to help! :)
Absolutely correct!
Sadly, to my knowledge, there is no community-accepted name for functions to combine predictions originating from ensemble learning. Most of the time you read about ensemble learning in biomedical image classification or segmentation, the authors applied averaging via mean and also just call it averaging.
Last year, we published an experiment analysis about ensemble learning in medical image classification, in which I called the combination methods to merge multiple predictions pooling functions and the averaging as mean (which can be either unweighted as well as weighted). Check out here: https://ieeexplore.ieee.org/document/9794729/
Hope this explains/helps a little bit on general ensemble learning in biomedical image analysis.
Best Regards, Dominik
Thanks, I'll check it out. :)
Hello,
I just have a couple of clarifying questions to ask.
I am a little bit unclear how you come up with final 0.9544 Dice coefficient value in the published MIScnn paper. Is there some kind of additional function that can be used to compare the test data to predictions? Or is that value returned during the cross-validation phase?
If the cross-validation phase is also doing the testing of the data, then how do we define the ratio of train/validate/test data? For example, my understanding is that of the roughly 300 studies in the KiTS19 dataset, you did 80/90/40 ratio? I am just trying to figure out how you set these parameters in the code.
As a final question for you, if I have a dataset of 60 studies, would a decent train/validate/test ratio be 30/15/15?