Does MOCO collapses under simpler augmentation?

facebookresearch / moco

PyTorch implementation of MoCo: https://arxiv.org/abs/1911.05722

MIT License

4.83k stars 794 forks source link

Does MOCO collapses under simpler augmentation? #79

Open a411919924 opened 3 years ago

a411919924 commented 3 years ago

Recently I have been applying your implementation to simpler datasets like CIFAR10. A strong augmentation will hurt the moco's convergence on a smaller dataset, so I simplified the augmentation when training MOCO on CIFAR10: transforms.RandomCrop(32, padding=4), transforms.RandomHorizontalFlip(), transforms.ToTensor(), normalize.

Under this simpler setting, the training top1 and top5 goes up to nearly 100% at about 20th epoch. The other hyperparameters are almost unchanged, except: batch size=512, lr=0.015, arch=resnet-18, tau=0.1, k=4096.

Is this a phenomenon of collapsing?

Thanks.

triangleCZH commented 3 years ago

Hi, I just curious if it is normal to have collapse in these approaches, because I thought contrastive learning would avoid collapse naturally with negative samples?

a411919924 commented 3 years ago

I think that these approaches are less likely to suffer from collapse comparing to other models like GANs. The InfoNCE loss can be seen to maximize the latent representation difference between an input vs (its another view and views of other inputs). Hence, the latent representations are designed to be scattered in a high-dimensional latent space.

Horizon2333 commented 3 years ago

I think that these approaches are less likely to suffer from collapse comparing to other models like GANs. The InfoNCE loss can be seen to maximize the latent representation difference between an input vs (its another view and views of other inputs). Hence, the latent representations are designed to be scattered in a high-dimensional latent space.

I think you are right. When data augmentation is simple, it cannot provide enough information for different views of each instance. So the trained models can put every instance far away from each other in latent space by over-fitting them thus get very high classification accuracy.

Euphoria16 commented 3 years ago

Did you test the downstream classification accuracy?