[Question] Post-processing

machur commented 1 year ago

Hi @mibaumgartner, I'm running experiments to compare nnDetection with the original nnUNet. I've analyzed nnDetection predictions against the volumetric ROIs generated based on nnUNet segmentations and I have some questions. There is a huge discrepancy between the smallest object detected by nnUNet and nnDetection - do you perform any type of post-processing? Do you remove the smallest blobs?

I've calculated blob-based statistics for both sets of results and the smallest blob found by nnDetection is 18mm3, smallest blob found by nnUNet is 0.12mm3 (it's probably a single voxel), so the difference is significant. Maybe it's a result of resampling of predicted bounding boxes/resolution of data? Also we already talked about the bounding box indices in other issue, so may it's related as well. Do you have any advise?

mibaumgartner commented 1 year ago

Yes, nnDetection performs some postprocessing to discard very small predictions, the exact threshold is determined during the post-processing sweep and can be found in the plan_inference.pkl file in the training folder. A similar post-processing sweep for parameters (including a minimal size) was also performed for the nnU-Net predictions in our paper.

Maybe the discrepancy happens due to the bounding box convention?

machur commented 1 year ago

It looks like the post-processing is the major culprit here. Unfortunately 25% of lesion instances in our dataset are smaller than 10mm3. I don't see an easy way to disable all postprocessing from the command-line though. Is there a workaround for predict.py, e.g., by overriding some steps using "-o" parameter? What would you advise?

I've analyzed the inference_plan.pkl and for the "inference_plan" label there is "remove_small_boxes": 3.0. If it's in mm3, it's suprising that the smallest predicted bounding box is 18mm3. I would still expect some boxes between 3mm3 and 18mm3, but I guess it's a starting point for further investigation.

print(pkl["postprocessing"]) {}

print(pkl["inference_plan"]) {'model_iou': 1e-05, 'model_nms_fn': <function batched_weighted_nms_model at 0x14bb5ddc6af0>, 'model_score_thresh': 0.0, 'model_topk': 1000, 'model_detections_per_image': 100, 'ensemble_iou': 0.30000000000000004, 'ensemble_nms_fn': <function batched_wbc_ensemble at 0x14bb5d1b5790>, 'ensemble_topk': 1000, 'remove_small_boxes': 3.0, 'ensemble_score_thresh': 0.0}

mibaumgartner commented 1 year ago

I think setting remove_small_boxes=0 would be a great starting point.

The value is currently performed in pixel space (which is probably suboptimal), meaning that the minimal size of boxes in pixel space needs to be 3 -> given the nnDetection convention, this results in the smallest predictable object size of 2x2x2 pixels which might result in the 18mm3?

machur commented 1 year ago

I run the inference one more time with remove_small_boxes=0, but the results are not very different, only couple of new boxes have been found. It looks like the main reason is the "indices-related" filtration of smallest objects of sizes 2^3.

Since the threshold is expressed in pixel space instead of world coordinates and implemented to resolve a technical issue, I assume that the same step is being performed before the training as well as part of pre-processing? @mibaumgartner would you please confirm? If that's true, I'm not surprise why the network cannot recall any small boxes on unseen data. As the next step I'm planning to translate the threshold to mm somehow to understand better which boxes were affected.

One more thing about hyperparameter selection that seems suspicious: during parameter sweep the following values for small boxes removal were found for 5 models: [2.0, 3.0, 0.01, 3.0, 3.0]. The value 3.0 was selected for the inference of the ensemble. It looks like one of the models was aware of smaller boxes (hence 0.01), but that aspect got lost during the consolidation. I was not able to quickly follow the mechanism of parameter sweeping, but I guess it may be an expected behavior.

mibaumgartner commented 1 year ago

Thanks for your detailed update, I'll give it another thought as well and try to build a replica with a toy dataset with tiny objects.

The filtering is only applied during the post-processing of the detections, no filtering is done of the ground truth labels or during pre-processing.

The parameter sweep simply tries a set of different parameters on the validation set and selects the best value (in Terms of the AP metric) for test-set inference. So, it is interesting that it looks like removing the small predictions seems to benefit the AP value.

One thing I was wondering about: Are there many objects which have a size of one pixel? This might become a problem during resampling, both during preprocessing and online data augmentation, where single pixel objects can get lost. While nnU-Net might tolerate this behaviour (since during its loss computation only a single pixel is missing) it has quite a large influence on nnDetection where an entire object is missing.

machur commented 1 year ago

Thanks @mibaumgartner, it would be great to double-check the sweeping.

Great point about the resampling, I'm gonna check which blobs were removed during that step. Now I recall that I've seen the warnings about the lost instances in the logs.

From my side I'm going to look closely at the bounding box removal in the code. I suspect that I disabled the post-processing, but the 2^3 boxes were removed anyway. All statistics I've generated are volume-based, I will regenerate them into voxel-based metrics to track the nnDetection thresholds easier. Thanks!

machur commented 1 year ago

I stumbled upon the following pre-processing step:

MicrosoftTeams-image

It looks like something that will affect the training, the results are as follows: 2023-02-15 13:58:54.993 | INFO | nndet.planning.architecture.boxes.c002:_plan_anchors:258 - Filtered 46 boxes, 1805 boxes remaining for anchor planning.

It's especially tricky for us since we are aware that we have some outliers in the dataset, but we distributed them evenly between training set and test set. Do you remember why such a step? Was it just a precaution?

mibaumgartner commented 1 year ago

The function is only used to determine the anchors of the network (i.e. the "templates" which are regressed and classified). Since we want the anchors to ignore outliers we filter them before running the planning. The network is still trained on all the data/objects irrespective of that function, so it should still be able to detect all objects, irrespective of their size.

machur commented 1 year ago

Ok, thanks for the explanation. I think I understand now what's going on with our predictions.

The nnDetection calculated the following anchors: nndet.planning.architecture.boxes.c002:_plan_anchors:272 - Determined Anchors: {'width': [3.0, 2.0, 5.0], 'height': [3.0, 2.0, 5.0], 'depth': [2.0, 3.0, 5.0]}; Results in params: {'width': [(3.0, 2.0, 5.0), (6.0, 4.0, 10.0), (12.0, 8.0, 20.0), (12.0, 8.0, 20.0)], 'height': [(3.0, 2.0, 5.0), (6.0, 4.0, 10.0), (12.0, 8.0, 20.0), (24.0, 16.0, 40.0)], 'depth': [(2.0, 3.0, 5.0), (4.0, 6.0, 10.0), (8.0, 12.0, 20.0), (8.0, 12.0, 20.0)]}

The target spacing is: nndet.planning.experiment.base:plan_base:166 - Base target spacing is [1. 0.9765625 1. ]

As I understand correctly the smallest predicted bounding boxes will be regressed to boxes of sizes 12mm3 and 18mm3 in this case and that's concise with our results. I did a mistake of comparing template-based bounding boxes of nnDetection with boxes generated artificially from segmentation masks (as 3D ROIs cropped to the masks). I guess there is no way to achieve smaller bounding boxes, even after resampling to the original spaces (with these particular nnDetection parameters).

From what I've seen the target spacing is selected in the same way as in nnUNet, but the smallest output of nnUNet is a single voxel and here we have 2x2x3 in pixel space, hence the discrepancies in smallest instances.

machur commented 1 year ago

I've analyzed my issue further and it looks like nnDetection does predict a lot of bounding boxes that are much smaller than the smallest anchors (initially I though that this was the issue), but almost all of them are false positives with really low prediction scores. From what I've observed the automatic post-processing step that you incorporated has had only positive impact on our results.

There might be some areas in nnDetection architecture to fine-tune this, but I'm afraid that smaller blobs are simply under-represented in our dataset in comparison to the big ones (maybe the quantities are similar, but the overall appearances differ a lot) and this is the main issue.

@mibaumgartner I think you may close this issue. Thank you for your time.

mibaumgartner commented 1 year ago

Hi @machur ,

thank you for the detailed update! Indeed, nnDetection performs a regression of the anchors and thus it can predict arbitrary sized bounding boxes, they are only intended as an initial "template". While nnDetection also performs some balancing on object level, it is really hard to counteract inbalances, especially if "object difficulty" plays an additional role (I would assume that single pixel objects are really difficult to detect anyway). Potentially, adjusting the sampling of the dataloader to prefer smaller objects might be one way to counteract this even more.

Best, Michael

MIC-DKFZ / nnDetection

[Question] Post-processing #144