The Teacher Network Update

Vibashan / irg-sfda

Official Pytorch codebase for Instance Relation Graph Guided Source-Free Domain Adaptive Object Detection [CVPR 2023]

https://viudomain.github.io/irg-sfda-web/

61 stars 7 forks source link

The Teacher Network Update #4

Closed kinredon closed 2 years ago

kinredon commented 2 years ago

I noted that the weights of the teacher network are updated every epoch? Usually, we update the teacher model every iteration. Why do authors choose such a strategy?

https://github.com/Vibashan/irg-sfda/blob/5a640001cfc38ef3ac416935677a5b15d3918602/tools/train_st_sfda_net.py#L313

Also, the paper states the =keep_rate= for teacher update is set to 0.99, and the code here is set to 0.9.

Vibashan commented 2 years ago

Hi @kinredon , As this is a source-free setup, we have no access to source data; we have only access to a source trained model. To learn to target specific representation, we need to train the model on the unlabelled target domain using pseudo-labels. However, due to domain shift, the generated pseudo-labels are noisy and self-training on top of the noisy pseudo-labels leads to catastrophic forgetting. Hence we opted for the student-teacher framework. During our initial experiments, we updated the teacher model for every iteration; however, as the pseudo-labels are so noisy, the student model gets easily overfitted to the noise. Further, due to ema, the noise from the student network is transferred to the teacher network for every iteration. Moreover, there is no supervision for the teacher network as we have no access to any labelled data. Thus after a few iterations, more noise gets transferred to the teacher network and essentially, performance gets lower then source only performance in some datasets. To avoid this, experimentally, we observed updating the teacher for each epoch works the best. This is because no noise is transferred from the student network for one epoch and in the meantime, the student network learns robust target representation.

Thanks for pointing out the typo; I will update it.

kinredon commented 2 years ago

@Vibashan Thanks for your quick response. All my questions have been addressed. Thanks again.

Vibashan commented 2 years ago

Thanks @kinredon , if you have any more concerns, please feel free to contact me.

Thanks.

kinredon commented 2 years ago

@Vibashan I have another question about contrastive loss, which is implemented here:

https://github.com/Vibashan/irg-sfda/blob/5a640001cfc38ef3ac416935677a5b15d3918602/detectron2/modeling/meta_arch/losses.py#L60

I carefully read the code and the statement in the paper, and I find the implementation is different from the statement in the paper. Eq.(8) in the paper shows that the denominator is the sum of A(i), which has excluded the i.

Vibashan commented 2 years ago

Yes, "it is A(i), including proposal i "

Thanks a lot.

kinredon commented 2 years ago

@Vibashan I was also confused by the construction of the graph.

https://github.com/Vibashan/irg-sfda/blob/5a640001cfc38ef3ac416935677a5b15d3918602/detectron2/modeling/meta_arch/GCN.py#L78

Why is the adj the L1 norm of dot(kv).square? does it have some advantages over Eq.(5) in the paper?

Vibashan commented 2 years ago

Our motivation is to utilize the graph network to understand the relationship between proposals and for a given proposal, we need to find its positive/similar proposal. Therefore, we need to model the relationship between positive pairs. In other to achieve this, we use the L1 norm, where the L1 norm provides sparsity while constructing the graph. In other words, the L1 norm sparsifying property ensures to prune out the non-correlated/negative proposal's relationship and focus more on positive proposals as training proceeds.

kinredon commented 2 years ago

@Vibashan Yes, the L1 norm provides sparsity, but dot_mat.square() will destroy the relationship. For example, proposals i and j have similarity -2, k and v have similarity 2, and the adj values become the same, i.e., 4 after employ square.

Vibashan commented 2 years ago

Hi @kinredon , I am not able to understand your question. Can you please explain it a bit more?

kinredon commented 2 years ago

@Vibashan The dot_mat is the matmul of the qx and kx. https://github.com/Vibashan/irg-sfda/blob/5a640001cfc38ef3ac416935677a5b15d3918602/detectron2/modeling/meta_arch/GCN.py#L77

The value of dot_mat represents their relationship (similarity), but dot_mat.square() will destroy their relationship. Suppose there are vectors v1=[-1, -1], v2=[1, 1], v3=[1, 1], the dot_mat for v1 & v3 and v2 & v3 are -2 and 2, respectively. Obviously, v2 and v3 are quite similar with strong relationship while v1 and v3 are totally different. However, dot_mat.square() make the similarity of v1 & v3 and v2 & v3 are both 4, which destroy the relationship. Hopefully I clarify my question.

Vibashan commented 2 years ago

Hi @kinredon, I am sorry for the delayed response; I had an exam and got into some work. For a given anchor/proposal, we utilize the IRG model for mining corresponding positive proposals for contrastive learning. Thus by performing dot_mat.square(), the model is constrained to learn the better correlation to differentiate positive proposals from negative. Thus the negative proposal similarity scores are pushed toward zero and positive proposal similarity scores towards one.