Estimating normalization factor Z

HobbitLong / CMC

[arXiv 2019] "Contrastive Multiview Coding", also contains implementations for MoCo and InstDis

BSD 2-Clause "Simplified" License

1.3k stars 179 forks source link

Estimating normalization factor Z #23

Open KaimingHe opened 4 years ago

KaimingHe commented 4 years ago

https://github.com/HobbitLong/CMC/blob/0f72b18a99e35bf2c2f0001656c2b33365b50cf6/NCE/NCEAverage.py#L189

This one-time estimation is problematic, especially if the dictionary is not random noise. Computing Z as a moving average of this would give a more reasonable result.

HobbitLong commented 4 years ago

Hi,

Thanks for your comment! Which specific result are you referring to? Or are you suggesting that an EMA of Z could potentially improve all InsDis, MoCo and CMC with NCE loss?

KaimingHe commented 4 years ago

You reported a low number of MoCo with the NCE loss. This is because your implementation of NCE is problematic and correcting it should gives a more reasonable MoCo w/ NCE number.

HobbitLong commented 4 years ago

@KaimingHe , yeah, probably the current NCE implementation is less suitable for MoCo, and I am happy to rectify it. What is the best momentum multiplier for updating Z you would like to suggest?

KaimingHe commented 4 years ago

0.99 for updating Z works well. In ImageNet-1K, MoCo with NCE is ~2% worse than MoCo with InfoNCE, similar to the case of the memory bank counterpart.

HobbitLong commented 4 years ago

Thanks for your input! I have temporarily removed the NCE numbers in README to avoid any confusion, and will keep them vacant until I get a chance to look into it.

kibok90 commented 4 years ago

Is it necessary to fix or EMA-update Z? Maybe it is unstable if we always compute Z = out.mean() * self.outputSize every time? Also, I couldn't find any statement about this approximation of Z in the paper, or maybe I missed it. Could you designate a reference point of this?

kibok90 commented 4 years ago

Later I found the statement in InsDis: "Empirically, we find the approximation derived from initial batches sufficient to work well in practice."