Using dot product as a proxy for probability in NCEAverage

HobbitLong / CMC

[arXiv 2019] "Contrastive Multiview Coding", also contains implementations for MoCo and InstDis

BSD 2-Clause "Simplified" License

1.3k stars 179 forks source link

Using dot product as a proxy for probability in NCEAverage #48

Closed vinsis closed 4 years ago

vinsis commented 4 years ago

Hi, it seems that you are using the dot product between vectors from two views as a proxy for unknown distribution denoted as p_d in your paper here. In other words, your h_θ is the dot product. Theoretically any h_θ can work so it's all good.

But doesn't it force the two representations to be similar? I understand the two representations should have high mutual information. But it is not the same as having the two vectors in similar directions.

Obviously it worked out pretty well. But do you think having a parameterized NCEAverage loss would have allowed for more representations with not so similar directions but still having high MI?

Thank you again!

HobbitLong commented 4 years ago

Hi, @vinsis ,

From the ImageNet experiment, there is a linear projection layer between the representation and contrastive loss.

You are absolutely right! MI is not equal to similar direction. But here is how I think about it, you are maximizing mutual information between representations before the projection. The projection is similar to a reparameterization and it works in such a way that inner product could estimate MI (though the estimation is almost biased).

vinsis commented 4 years ago

I am sorry I am not sure I understand what you mean by a linear projection layer between the representation and contrastive loss. Do you mean something like the below in model specification?

        self.fc8 = nn.Sequential(
            nn.Linear(4096 // 2, feat_dim)
        )

One more thing I noticed is that NCE learns a distribution by learning to classify samples as real or fake. In this case, a sample is dot product between v₁ⁱ and v₂^j. In other words the classifier is trying to learn which dot products are real (aka come from the same image) and which ones are fake. We could extend it to any inner product space, not just dot product and possibly get diverse representations while preserving high MI.

HobbitLong commented 4 years ago

I originally meant this one. But yours is also a good example.

Though I don't know which inner product you want to try, I agree there should be more possibilities.

Should I close this?

vinsis commented 4 years ago

Thanks again @HobbitLong. Closing it now.