What the difference between SoLo V2 and CondInst (https://arxiv.org/pdf/2003.05664.pdf)

chuong98 commented 4 years ago

Hello, thanks for the paper. I read over the two papers SoLo V2 and Cond Inst from your lab. But I barely see the difference between the two methods, except for the Matrix NMS.

If I understand correctly, the output of mask branch is no longer the cell location categories as in Solo-V1, (HxWxS^2), so it no longer inherit the key idea of SoLo.

Would you help to point out their difference, and a compare their performance in term of Speed? Thank you.

WXinlong commented 4 years ago

@chuong98 Thanks for your question. SOLOv2 follows the core designs of SOLOv1. You are suggested to read Section 3 of SOLOv2 paper to see the step-by-step derivation. It still segments objects by their cell location categories, and goes a step further by predicting the segmenters by locations. Both the Decoupled SOLO head and this Dynamic SOLO head are varieties of SOLO idea (see Fig.2 in our paper).

About the comparison between SOLOv2 and CondInst:

CondInst relys on the relative position to distinguish instances as in AdaptIS, while SOLOv2 uses absolute position as in SOLOv1;
The former uses bounding box detection in training and inference, while the latter takes an image as input, directly outputs instance masks and corresponding class probabilities. For example, CondInst has 4 or 5 loss terms and SOLOv2 has 2 loss terms.
More detailed differences of design choices could be refered to the papers.

To me they are both good works and explore instance segmentation from different viewpoints.

chuong98 commented 4 years ago

Thanks for your reply,

It still segments objects by their cell location categories, and goes a step further by predicting the segmenters by locations

I just want to be clear. From my understanding:

for Vanilla or Decoupled Head, the output is a single mask with HxWxS^2, with S^2 categories. This is the unique idea of Dual-Classification. I like it.
In contrast, Dynamic Head output several binary masks, each with size HxW. The binary mask is obtained by checking if the cell (i,j) is an object's center, then take the kernel coeff G of this cell to perform convolutions with the shared mask-feature. This is identical to CondInst's idea, and no longer perform the Dual-Classification.

WXinlong commented 4 years ago

@chuong98 The dynamic scheme part is somewhat similar, as they both are inpired by Dynamic Filter Networks. But the methodology is different as stated above. You can say that all roads lead to Rome and we choose the simplest one.

WXinlong / SOLO

What the difference between SoLo V2 and CondInst (https://arxiv.org/pdf/2003.05664.pdf) #13