Closed rohit-gupta closed 1 year ago
We follow the implementation of timesformer and vivit here. I remember the inactivation of this resulted in a little drop in ablation. I guess it is because the newly added module is much easier to overfit, hence the requirement of extra dropout.
In the STAN head code, a dropout layer is applied to the residual branch coming in from CLIP. Is there any particular reason for this ?