[Question] Low prediction scores

manonvanerp commented 2 years ago

:question: Question

Hi, I've been using nnDetection for quite some time now and in my experiments the predictions score I get in test_predictions are usually quite low (max 0.45 most of the time). From previous detection networks I would get scores around 0.9 or even higher. Is it common to have these scores in nnDetection? Should I train for more epochs? Maybe you have a bit more insight into this than me. Thanks in advance. The results are good btw, thanks for sharing this approach.

mibaumgartner commented 2 years ago

Dear @manonvanerp ,

that is indeed quite low and usually, there are a bunch of predictions with high probabilities (0.9+). The exact distribution depends on the underlying dataset, e.g. for most tasks we see a U shape (i.e. many predictions with high proabs and low probs with less predictions around 0.5). Some very difficult datasets only have very few predictions with high probabilities though.

One thing to check: To get a better understanding of what exactly causes the low predictions, looking at the 5 Fold CV results might be a good start: e.g. if the test score is low this could be due to severe disagreement of the ensemble (i.e. only 2/5 models predicting the object) or all models agreeing with a low score. The latter one could be an indication that the config needs to be adjusted for your dataset (e.g. switching to a balanced sampler if there is a severe class imbalance or many scans/patients are empty), the first one could be quite tricky (maybe some kind of distribution shift between the folds? if multiple classes are present and there is a large class imbalance, the folds might not be balanced well since the default is a simple random split)

Best, Michael

manonvanerp commented 2 years ago

Hi Michael,

Thanks a lot for the quick and in depth response. I took a quick look at the prediction scores on the validation set for the different folds. It looks like there are 2 folds whose average prediction score (0.0598 and 0.0506) is about double that of the other 3 folds (0.0285, 0.0291 and 0.0327). If I zoom in to the 20 highest scores these 2 folds get an average prediction score of 0.152 and 0.234. The other folds get 0.119, 0.124, 0.112. So yeah it could be that in the ensemble the models disagree too much. I actually only trained with one class now but my data does have potential for different classes so I will take a closer look at how the cases are split over the different folds. I didn't include empty scans in this case.

Again thanks for the response! Let me know if you are interested in a follow up or would prefer I don't bother you anymore :).

Best, Manon

mibaumgartner commented 2 years ago

Dear Manon,

yes, I would be happy to hear from the follow-up. It is always interesting to hear from other projects using nnDetection. The average scores are a good quantitative measure but you could also check the histograms in the [val/test]_results/boxes folder for qualitative analysis.

Best, Michael

manonvanerp commented 2 years ago

Hi Michael,

I think it's indeed a class imbalance problem in the sampling. I've now trained a model with classes, it has about the same prediction scores. What really stands out to me is this FROC curve from the val_results from a single fold. It seems like it totally unable to detect cases of class 0? How can this happen? I've looked at the splits.pkl and counted the different classes and the train is split 33%/66% class 0/class 1, and validation 35%/65% class 0/class 1. Not a nice 50/50 split but also not so bad that I would expect all metrics for class 0 to be 0. I've looked into different datasampler inside nnDetection and found that probably it's better to use "DataLoader3DBalanced". Does this influence the sampling for batches or for the entire validation train split? Could you possible provide some more information about the parameter 'oversample_foreground_percent'? I don't really understand how I should set in properly for my dataset.

Best, Manon

mibaumgartner commented 2 years ago

Dear Manon,

To answer your question in full depth we need to clarify the different phases/parts:

(By default) the data split is done randomly when the dataset is first trained and then re-used by any further run. This will not do any rebalancing etc. It might be worth creating a manual split that balances the classes between different folds e.g. having one fold with 90/10 and another one with 60/40 is usually not desired but if all folds are similar e.g. 75/25 is fine (or at least the best we can do). -> In practice, we can not change the underlying distribution of the classes so the splits should follow the “original” distribution
The DataLoader is responsible to sample data that is presented to the network during training. While we can not change the underlying distribution of the classes in the dataset, we can change which distribution we present to the network. The default data loader samples random patients to extract background patches and selects a random object for foreground patches.
- oversample_foreground_percent forces the data loader to sample a certain proportion of those patches with a foreground class. If the whole patient scan is the same/smaller/slightly larger size as the patch size for training, this won’t do a whole lot. In cases where the patch size is much smaller than the patient size (e.g. in CT scans), this makes sure we are not just sampling background and the network simply predicts background all the time. -> So this parameter is responsible to balance foreground and background patches
The default data loader will sample objects randomly which is fine when the class imbalance isn't too dominant. DataLoader3DBalanced assigns equal weights to each object class and thus tries to balance them when selecting objects and patients.

 Phases:

The data loader will balance the training and online validation. The online validation contains the validation batches that are sampled at the end of each epoch. It only samples a certain number of patches there, it will not(!) predict the whole patient.
For the “offline” validation at the end of the training the whole patient is predicted, so this will follow the “original” distribution (defined by the split and data) and does not apply any kind of sampling.

 There are quite a lot of caveats on how to sample patients and patches and I hope to improve this in a future release. For now, it is probably best to use DataLoader3DBalanced for multi-class problems (assuming the "difficulty" of the classes is somewhat equal).

Other experiments which might generate additional insights: A potentially interesting experiment would be to train networks for each class individually, e.g. create a dataset that only contains objects of class 0 (make sure to use the same split though) and check the perf. If this experiment works fine, there might be something different off.

Best, Michael

MIC-DKFZ / nnDetection

[Question] Low prediction scores #51

:question: Question