facebookresearch / MaskFormer

Per-Pixel Classification is Not All You Need for Semantic Segmentation (NeurIPS 2021, spotlight)
Other
1.35k stars 151 forks source link

How is the label constructed? #13

Closed github-luffy closed 3 years ago

github-luffy commented 3 years ago

The first step is to train the binary mask, so what does the label in loss look like? How to set the N value?

bowenc0221 commented 3 years ago

The binary masks are the foreground mask for every category in semantic segmentation, of shape N_gt x H x W where the value is either 0 or 1.

N is set to 100 for all models.

github-luffy commented 3 years ago

Thanks for your reply, but there is another question, if Ngt is 20 and N is 100, then how to calculate the loss?

bowenc0221 commented 3 years ago

We first perform a bipartite matching between predictions (N=100 for example) and ground truths (Ngt=20 for example) based on some classification cost and mask cost. Only 20 out of 100 predictions will be assigned positive labels and the corresponding matched ground truth is target when calculating the loss, and the remaining 80 predictions will be assigned as background (no mask loss).

github-luffy commented 3 years ago

We first perform a bipartite matching between predictions (N=100 for example) and ground truths (Ngt=20 for example) based on some classification cost and mask cost. Only 20 out of 100 predictions will be assigned positive labels and the corresponding matched ground truth is target when calculating the loss, and the remaining 80 predictions will be assigned as background (no mask loss).

Great.But I am confused, Why not directly set N to Ngt (N==Ngt)?

bowenc0221 commented 3 years ago

We find it performs better.

image

Even when N==Ngt, bipartite matching performs better than fixed matching (Table 4).

Please check our paper for more detail.

github-luffy commented 3 years ago

Thanks.

xianshunw commented 2 years ago

We first perform a bipartite matching between predictions (N=100 for example) and ground truths (Ngt=20 for example) based on some classification cost and mask cost. Only 20 out of 100 predictions will be assigned positive labels and the corresponding matched ground truth is target when calculating the loss, and the remaining 80 predictions will be assigned as background (no mask loss).

Hi, I am confused that why don't you assign the gts to the predictions by some pre-defined rule. Is it possible that the bipartite matching makes the network converges slower? For example, if the 1st prediction is assigned with the 1st gt at some iteration and it may assigned with another different gt at some other iteration, which makes the supervision signal isn't consistent and will result in slower training speed.

However, if you assigned the gts by some pre-defined rule such as instance size, no matter at which iteration, you can always ensure that same prediction are assigned with same gt.