The experiments of vgg16 without BN layers?

kinredon commented 2 years ago

Hi, thanks for your awesome project!

When I dive into the detail of adaptive_teacher, I find that the vgg16 backbone has BN layers by default.

https://github.com/facebookresearch/adaptive_teacher/blob/cba3c59cadfc9f1a3a676a82bf63d76579ab552b/adapteacher/modeling/meta_arch/vgg.py#L62

As far as I know, the vgg16 backbone does not contain BN layers, which will improve the baseline of cross-domain detection performance, making it an unfair comparison, as previous works use pure vgg16 backbone without BN layers. I observed that the proposed method outperforms previous methods by a large margin. Do the authors conduct the experiments of vgg16 without BN layers?

Thanks a lot!

yujheli commented 2 years ago

Hi @kinredon. Thank you for noting the issue of batch normalization. In our experiments, we found batch normalization does not bring that much difference (1~2%) in the implementation in this setting though we will provide the results without using them soon in the readme.

As we see the results using Resnet-101 on the other setting and we still have large improvements using our framework. We also found that setting unsupervised weights as 0.5 or 0.25 can bring at least ~3% or more performance gain than the results reported in the paper. That is, the effectiveness of using our framework is worthing more research.

kinredon commented 2 years ago

@yujheli Thanks for your reply. Yes, I see the large improvements in the setting of clipart and watercolor using Resnet-101 as the backbone. Due to previous works using vgg16 without BN layers, it is important to keep the same backbone for a fair comparison. If possible, revising the paper is also promising.

Besides, I also have some questions about the implementation:

Why instance distributions are different between cityscapes and foggy cityscapes?
Do you conduct the experiment using a small batch size? 8 V100 GPUs are totally out of my resources.

kinredon commented 2 years ago

@yujheli There is a small bug in: https://github.com/facebookresearch/adaptive_teacher/blob/cba3c59cadfc9f1a3a676a82bf63d76579ab552b/adapteacher/modeling/meta_arch/rcnn.py#L246

This branch supervised_target is never executed although it does not influence the results because the adversarial loss weight is quite small.

kinredon commented 2 years ago

@yujheli Figure 2 in the paper shows that student networks adopt strong augmentation data to utilize adversarial training. However, in the code:

https://github.com/facebookresearch/adaptive_teacher/blob/cba3c59cadfc9f1a3a676a82bf63d76579ab552b/adapteacher/engine/trainer.py#L612

The adversarial loss is calculated according to weakly augmented images instead of strongly augmented images as the paper states.

tmp12316 commented 2 years ago

I am also looking forward to the results without BN.

I noticed the super-high source-only results before, and I wonder if the updated BN layers (w. super-large batch size) caused it?

https://github.com/facebookresearch/adaptive_teacher/issues/4#issue-1229286700

kinredon commented 2 years ago

@tmp12316 I think there are serval reasons for high source-only results:

As you mentioned, the model with BN may benefit from the large batch size.
Detectron implementation usually show better results, as mentioned in https://github.com/facebookresearch/adaptive_teacher/issues/4#issuecomment-1121206200
I note that the model is trained using both strongly and weakly augmented images at the burn-in stage.

tmp12316 commented 2 years ago

@tmp12316 I think there are serval reasons for high source-only results:

As you mentioned, the model with BN may benefit from the large batch size.

Detectron implementation usually show better results, as mentioned in Question about Figure 4 #4 (comment)

I note that the model is trained using both strongly and weakly augmented images at the burn-in stage.

Thanks! I agree with you.

yujheli commented 2 years ago

@kinredon Thank you for your patience. I sweep the parameters (12 experiments each with 8GPUs: around 100 V100 GPUs) for the network without BN and still get comparable performance (best with 50 mAP@50 obviously)

I will answer other questions later

kinredon commented 2 years ago

@yujheli Thanks for your reply. I also try to reproduce the results where the vgg16 does not contain BN layers, but I encounter an error when the code starts the stage of mutual learning:

I follow the default config and only remove the BN layer by setting the batch_norm=False:

https://github.com/facebookresearch/adaptive_teacher/blob/cba3c59cadfc9f1a3a676a82bf63d76579ab552b/adapteacher/modeling/meta_arch/vgg.py#L62

I think this may be caused by gradient explosion, thus I add a gradient clip, but the above error still occurs. Do you have some suggestions or share the config of your sweep.

yujheli commented 2 years ago

@kinredon Sry for late reply. The VGG16 can not directly change batch_norm as False due to the architecture is hard code for using batch normalization. I run using my internal codes and will update this version later. The adversarial loss is using weak-augmentation since we hope the discriminator is designed to optimize domain shift instead of being affected by strong augmentation. The small bug is for stablizing the distributed issue (you can find in the issue sections). I did not have this bug in my original codes.

kinredon commented 2 years ago

@yujheli Thanks for your reply. I am looking forward to seeing your update.

kinredon commented 2 years ago

@yujheli Hi, I find the number of target samples is more than the one of the source domain on cityscapes to foggy cityscapes. I guess you use all the foggy data instead of the worst foggy as in many previous works. There are three foggy levels in foggy cityscape, i.e., 0.005, 0.01, and 0.02. Existing works (DA-Faster RCNN, strong-weak distribution alignment, and so on) use the worst foggy data (i.e., 0.02) instead of the whole foggy dataset.

yujheli commented 2 years ago

@kinredon Thank you for noting the issue and sry for the late reply. I have been busy yet I will report the results using only 0.02 soon.

onkarkris commented 2 years ago

@yujheli Thanks for your reply. Yes, I see the large improvements in the setting of clipart and watercolor using Resnet-101 as the backbone. Due to previous works using vgg16 without BN layers, it is important to keep the same backbone for a fair comparison. If possible, revising the paper is also promising.

Besides, I also have some questions about the implementation:

Why instance distributions are different between cityscapes and foggy cityscapes?

Do you conduct the experiment using a small batch size? 8 V100 GPUs are totally out of my resources.

@kinredon Can you reproduce Cityscapes → Foggy Cityscapes results with small batch size on lesser GPUs? If so can you please share your config here? I can get only 41% with batch size 4 on 4 GPUs.

kinredon commented 2 years ago

@onkarkris When the batch size is small, I also cannot reproduce the results. Someone said that you can accumulate the gradients to achieve a large batchsize.

alvinti commented 1 year ago

@yujheli Hi, I find the number of target samples is more than the one of the source domain on cityscapes to foggy cityscapes. I guess you use all the foggy data instead of the worst foggy as in many previous works. There are three foggy levels in foggy cityscape, i.e., 0.005, 0.01, and 0.02. Existing works (DA-Faster RCNN, strong-weak distribution alignment, and so on) use the worst foggy data (i.e., 0.02) instead of the whole foggy dataset.

hi! I'm wondering if you have trained the model by using the worst foggy(as 0.02)? If so, could please offer the AP50 of the result?

facebookresearch / adaptive_teacher

The experiments of vgg16 without BN layers? #16