Closed wangzeyu135798 closed 3 years ago
Hi!
You can give a look to this paper https://arxiv.org/abs/1907.11065. The idea is to drop attention weights which you obtained from self-attention, as standard dropout on features does, and thus regularizing the training.
Chiara
Hi: In spatial_transformer.py line 131 if (self.drop_connect and self.training): mask = torch.bernoulli((0.5) torch.ones(B self.Nh V, device)) mask = mask.reshape(B, self.Nh, V).unsqueeze(2).expand(B, self.Nh, V, V) weights = weights mask Why dose weight multiply mask will drop connect and avoid overfitting?