MIC-DKFZ / nnDetection

nnDetection is a self-configuring framework for 3D (volumetric) medical object detection which can be applied to new data sets without manual intervention. It includes guides for 12 data sets that were used to develop and evaluate the performance of the proposed method.
Apache License 2.0
553 stars 97 forks source link

[Question] Low prediction scores #51

Closed manonvanerp closed 2 years ago

manonvanerp commented 2 years ago

:question: Question

Hi, I've been using nnDetection for quite some time now and in my experiments the predictions score I get in test_predictions are usually quite low (max 0.45 most of the time). From previous detection networks I would get scores around 0.9 or even higher. Is it common to have these scores in nnDetection? Should I train for more epochs? Maybe you have a bit more insight into this than me. Thanks in advance. The results are good btw, thanks for sharing this approach.

mibaumgartner commented 2 years ago

Dear @manonvanerp ,

that is indeed quite low and usually, there are a bunch of predictions with high probabilities (0.9+). The exact distribution depends on the underlying dataset, e.g. for most tasks we see a U shape (i.e. many predictions with high proabs and low probs with less predictions around 0.5). Some very difficult datasets only have very few predictions with high probabilities though.

One thing to check: To get a better understanding of what exactly causes the low predictions, looking at the 5 Fold CV results might be a good start: e.g. if the test score is low this could be due to severe disagreement of the ensemble (i.e. only 2/5 models predicting the object) or all models agreeing with a low score. The latter one could be an indication that the config needs to be adjusted for your dataset (e.g. switching to a balanced sampler if there is a severe class imbalance or many scans/patients are empty), the first one could be quite tricky (maybe some kind of distribution shift between the folds? if multiple classes are present and there is a large class imbalance, the folds might not be balanced well since the default is a simple random split)

Best, Michael

manonvanerp commented 2 years ago

Hi Michael,

Thanks a lot for the quick and in depth response. I took a quick look at the prediction scores on the validation set for the different folds. It looks like there are 2 folds whose average prediction score (0.0598 and 0.0506) is about double that of the other 3 folds (0.0285, 0.0291 and 0.0327). If I zoom in to the 20 highest scores these 2 folds get an average prediction score of 0.152 and 0.234. The other folds get 0.119, 0.124, 0.112. So yeah it could be that in the ensemble the models disagree too much. I actually only trained with one class now but my data does have potential for different classes so I will take a closer look at how the cases are split over the different folds. I didn't include empty scans in this case.

Again thanks for the response! Let me know if you are interested in a follow up or would prefer I don't bother you anymore :).

Best, Manon

mibaumgartner commented 2 years ago

Dear Manon,

yes, I would be happy to hear from the follow-up. It is always interesting to hear from other projects using nnDetection. The average scores are a good quantitative measure but you could also check the histograms in the [val/test]_results/boxes folder for qualitative analysis.

Best, Michael

manonvanerp commented 2 years ago

Hi Michael,

I think it's indeed a class imbalance problem in the sampling. I've now trained a model with classes, it has about the same prediction scores. What really stands out to me is this FROC curve from the val_results from a single fold. image It seems like it totally unable to detect cases of class 0? How can this happen? I've looked at the splits.pkl and counted the different classes and the train is split 33%/66% class 0/class 1, and validation 35%/65% class 0/class 1. Not a nice 50/50 split but also not so bad that I would expect all metrics for class 0 to be 0. I've looked into different datasampler inside nnDetection and found that probably it's better to use "DataLoader3DBalanced". Does this influence the sampling for batches or for the entire validation train split? Could you possible provide some more information about the parameter 'oversample_foreground_percent'? I don't really understand how I should set in properly for my dataset.

Best, Manon

mibaumgartner commented 2 years ago

Dear Manon,

To answer your question in full depth we need to clarify the different phases/parts:


Phases:


There are quite a lot of caveats on how to sample patients and patches and I hope to improve this in a future release. For now, it is probably best to use DataLoader3DBalanced for multi-class problems (assuming the "difficulty" of the classes is somewhat equal).

Other experiments which might generate additional insights: A potentially interesting experiment would be to train networks for each class individually, e.g. create a dataset that only contains objects of class 0 (make sure to use the same split though) and check the perf. If this experiment works fine, there might be something different off.

Best, Michael