About ablation study on alpha_0

DISAPPEARED13 commented 2 years ago

Hi there, I notice that you ablate alpha_0 in Tab.7 of paper, showed that 20% is the best setting for your work. Firstly, the top-20% low confidence for negative samples and the top-20% high confidence pixels are positive samples. Is that correct?

you said:

Small alpha_0 will introduce incorrect pseudo labels for supervision, and large alpha_0 will make the information of some high-confidance samples underutilized.

I can understand why the larger alpha_0 will miss some useful information: for not apparently distinctions between the reliable pixels and the other ones. but why the smaller one will introduce the wrong information?

My thinking: I think the information from teacher is probably incorrect, but getting the low-confident prediction from teacher is much easier than getting the high ones. For the high-confident prediction, if teacher provide a large contiguous component of wrong predictions, will smaller alpha_0 make better sense?

My experiment: For example, it's some classes(but not all) in a patch image, but the teacher predict class A as class B, in the B-object which was predicted as A-object, it can still get high-confident pixels(inside the object as we all known just like Fig.2's motobike look), while compute the low_mask_all, but this metrics will provide wrong information, wrong predictions is the root cause. Although the number of iterations increasing, but how to make sure that everything will go as we expect from the "wrong begining"?

When I adjust alpha_0 form 20% to 15%, this wrong component is gone, the confusing B-object not belongs to high-confident set, and low-confident set, either. It seems that a smaller alpha_0 can filter the wrong prediction for stricter restrictions. But maybe patch-specific, I think.

Thanks a lot!

Haochen-Wang409 commented 2 years ago

Hi, thanks for your detailed explanation and even experiments!

the top-20% low confidence for negative samples and the top-20% high confidence pixels are positive samples?

Yes, this is totally correct. But I would like to explain the differences between positive samples and pseudo-labels first.

Positive samples are positive keys for a given query (or anchor) in contrastive learning. But pseudo-labels are considered as pixel-level ground-truths for unlabeled data and will participate in computing cross-entropy loss.

The reasons why small alpha_0 will bring extra incorrect pseudo-labels are summarized as follows.

Assume that the training is at epoch 0 (i.e., alpha_t = alpha_0 at this time). Refer to Eq. (6) and the definition of gamma_t in Eq. (6), when alpha_0 = 10%, that means top-90% predictions are regarded as pseudo-labels, while when alpha_0 = 50%, only top-50% predictions are left to be pseudo-labels. Obviously, the former introduce more noise than the latter.

By the way, I am a little bit confused about your experiment part, could you please provide a further explanation? It may be better to take Fig. 2 as an instance for better understanding.

DISAPPEARED13 commented 2 years ago

Oh, sure, thanks for reminding me of the difference for pseudo and positive keys.

I forget the Eq.(6) so that I think all day for the influnce on the unsupervise loss and contrast loss(the progress of computing these losses are toooooooo similar! :stuck_out_tongue_closed_eyes: ).

Thanks a lot! As in unsupervised loss, we can filter the pseudo labels by alpha_t, it's fine. And in contrast loss situation, we use the same setting of alpha_t, this means that there are some overlapping in positive samples and pseudo-labels. espacially in low_valid_pixel and high_valid_pixel, but clearly, they are not play the pseudo-label role in computing contrast loss.

I am just curious that what if using different alpha while getting the entropy_mask? For example, using 15% to filter the high-confident ones while 35% for the low-confident one. Is it reasonable?

My further explaination: Let me give an example of prediction mask.(The objects inside are not the same classes.)

This is the 20% filter to get the low_entropy_mask in unlabeled prediction from teacher.

And here is the 15% filter.

The red color means the low entropy, and I didn't plot the high entropy mask situation to keep it simple. The circle prediction in the center is correct, skip it. while the irregular prediction(uncorrect! not belong to this class), that's where I care much about, you can find the difference in it, for the different alpha's sake. and the smaller alpha restricted it for less pixels filtered.

To some extent, smaller alpha can provide from considering some incorrect predictions as positive candidates.

If the irregular prediction brings wrong information, it will degenerate when we sampling positive keys in that class(though might be a very small component).

Haochen-Wang409 commented 2 years ago

Did you mean using different alpha for unsupervised loss (Eq. (6)) and contrastive loss (Eq. (10))? Yes, of course you can. But these two hyper-parameters may need being further tuned.

If the irregular prediction brings wrong information, it will degenerate when we sampling positive keys in that class.

Yes, I agree with that. But it may be a little bit hard to exactly filter all incorrect predictions by a simple entropy threshold. Other advanced techniques should be involved to be the metric.

DISAPPEARED13 commented 2 years ago

Did you mean using different alpha for unsupervised loss (Eq. (6)) and contrastive loss (Eq. (10))?

Yes, sure. But might be just let the supervise loss play a main-guided role at the first begining will reduce this influence. Hard key seems still the most easy and efficient way to filter pixels.

Haochen-Wang409 / U2PL

About ablation study on alpha_0 #90