facebookresearch / detr

End-to-End Object Detection with Transformers
Apache License 2.0
13.4k stars 2.42k forks source link

Need suggestions to reduce the "bleeding" of predicted masks #255

Open danielfennhagencab opened 3 years ago

danielfennhagencab commented 3 years ago

Hi again!

I have had some success using detr, however I run into a issue with the panoptic segmentation which I don't encounter using other models like Mask RCNN.

First lets get some things out of the way:

Anyways, let's have a look at this image (which is from the same source as some of the generated training data, however not in the training): carimage

Firstly, the prediction is very good, and I would not complain if all my images were this good, but lets ignore that for now. What I'd like to focus on is the dark blue part at the far right.

We'll get back to this but lets first have a look at what the object detection head has predicted for this image: carimage_bbox

There's a lot of bounding boxes here but what I'd like to focus on is the rear_door, its clearly not containing the very right parts that we see in the previous image.

Alright, let's now get into a real example where the effect is more potent: 045

Boxes just to make sure nothing is wrong
Some labels are overlapping but from experience I can say it's front_door and not_door overlapping.

The front fender I have no problem with, and I think it's obvious that the rear_door has some issues. I would like to add this effect is not found in all images, I'd say that half of them are good predictions and half contains this "bleeding" effect.

The question I like to solve is: Why is this occuring in detr and not in other models? I have trained a Mask RCNN model using the same dataset, and this effect never occurs in that model. Since Mask RCNN performs well on this datased I'd like to think the problem is not from the dataset. I understand that Mask RCNN's output is limitied by the RoI which makes the masks not "bleed" outside of the detected bounding box, however I thought the transformer would work in a similar way?

Mask RCNN in case above statement is not believable

When detr predicts without the bleeding, mask are consistently better than our other models. However the bleeding in about half of the images ruin the results.

Have you guys encountered this problem and what was your solution to this?

Thanks!

alcinos commented 3 years ago

Hi @DanielFennhagen, Thank you for your interest in DETR. I'm trying to understand precisely what you are trying to do, that will guide us towards accurate possible solutions. You mention that you are using DETR in a panoptic segmentation way, is this correct? If so, I presume you are also using our panoptic post-processor? Panoptic segmentation is a particular task, because it requires 1) non overlapping masks and 2) that all pixels are annotated. I see that you are comparing with MaskRCNN, which does NOT do panoptic segmentation but classic instance segmentation (and indeed you can see that the masks are overlapping in your sample from Mask-RCNN). Could you clarify if you indeed need the panoptic property (in which case your annotations should also respect the two conditions I listed) or not?

For your red car example, I don't understand why the mask image contains only 3 boxes, while you plot 5 in the box-only image? What happened to the overlapping front_door and not_door boxes?

Also, it seems that the green car image is synthetic while the red one is real, is it the case that the model suffers from data distribution shift? In that case, I'd advise looking into more aggressive data-augmentation strategies.

Finally, last but not least, a bug was recently spotted in the mask head (see #247), it could reduce performance a bit in your case as well.

Best of luck

danielfennhagencab commented 3 years ago

Hi @alcinos, thanks for the quick reply!

You mention that you are using DETR in a panoptic segmentation way, is this correct?

Yes, I train detr using the coco_panoptic flag and feed the network with panoptic images generated from the object_detection json file. (The network is obviously also fed with normal images along with the object_detection json)

A example panoptic train image
All pixels are colored, and again, this is generated using the segment data from our object_detection json file. There is a check at generation stage which makes asserts no segments are overlapping.

If so, I presume you are also using our panoptic post-processor?

At inference we use the post-proccessor.

Could you clarify if you indeed need the panoptic property ... or not?

We do not need the panoptic property, however during some testing we found that the masks had clearer boundaries when we trained using the panoptic setting, is this wrong?

For your red car example, I don't understand why the mask image contains only 3 boxes, while you plot 5 in the box-only image? What happened to the overlapping front_door and not_door boxes?

I believe the front_door mask had too low confidence and was culled from the result. I ran inference on your colab notebook and recieved this result: 045

and for the not_door, I remove that mask in my overlay function, sorry for not specifying that. I have not looked into the post proccessor so I don't know how customizable it is, but just in case I displayed the results before the post proccessor:

70% and 30% confidence keep

Also, it seems that the green car image is synthetic while the red one is real, is it the case that the model suffers from data distribution shift? In that case, I'd advise looking into more aggressive data-augmentation strategies.

This is definitely something to keep in mind, however I believe that the distribution is currently fair considering it other models converge with decent results.

Finally, last but not least, a bug was recently spotted in the mask head (see #247), it could reduce performance a bit in your case as well.

I'll modify my fork to include the code from #247, thanks!

Thanks again and looking forward to your respone :)

alcinos commented 3 years ago

The panoptic post-processor is optimized to maximize the PQ metric, but I don't think that's very aligned with what you want. In your example, since your front door box is dropped, the remaining masks have to "fight" to fill the corresponding pixels (to ensure the exactly one class per pixel). But from your raw mask plot it seems that you have a good front door mask, it's just unfortunate that the confidence is low Here are a bunch of suggestions, some of which are mutually exclusive, while other are additive, feel free to pick what makes sense:

Best of luck

Dicko87 commented 2 years ago

Hi there @DanielFennhagen, @alcinos , I am wondering if the above problem was solved because I am running into the exact same problem. My bounding boxes are perfect, the masks were looking good but the panoptic png image is rubbish. Here are the details:

Lovely looking masks (Although on the same image when there should be a mask for each stick) the bounding boxes are good also. I just cannot show the colour image for confidentiality purposes. image

There are two bounding boxes with a good confidence for each around the detected sticks too. It seems it's the panoptic part which is looking bad and also something wrong as all masks are being showed in one image rather than separately.

My process is this: image which comes from here: image

I then put the model in evaluation mode and feed it an image: image

Then define result to be: image

Then use it to give me the png image which is rubbish. image