Closed NorbertZheng closed 1 year ago
Using Noise-Contrastive Estimation (NCE) for Efficient Learning.
In this story, Learning Word Embeddings Efficiently with Noise-Contrastive Estimation, (NCE), by DeepMind, is briefly reviewed.
In this paper:
This is a paper in 2013 NeurIPS with over 600 citations. Noise-Contrastive Estimation (NCE) is another basic concept of Contrastive Learning in self-supervised learning.
$$ P{\theta}^{h}(w)=\frac{\exp(s{\theta}(w,h))}{\sum{w'}\exp(s{\theta}(w',h))}. $$
NCE is based on
The basic idea is to train a logistic regression classifier to discriminate between samples from the data distribution and samples from some “noise” distribution, based on the ratio of probabilities of the sample under the model and the noise distribution.
$$ P{h;\theta}(D=1|w)=\frac{P{h;\theta}(w)}{P{h;\theta}((w)+kP{n}(w)}=\sigma(\Delta s_{\theta}(w,h)), $$
By using NCE, the summation originated at the denominator of the Softmax function can be skipped. Because NCE is an unnormalized model.
NCE training time is linear in the number of noise samples and independent of the vocabulary size.
As we increase the number of noise samples $k$, this estimate approaches the likelihood gradient of the normalized model.
Accuracy in percent on word similarity tasks.
Accuracy in percent on word similarity tasks for large models.
Results for various models trained for 20 epochs on the 47M-word Gutenberg dataset using NCE5 with AdaGrad ((D) and (I) denote models with and without position-dependent weights respectively).
Accuracy on the MSR Sentence Completion Challenge dataset.
Another work, which is “Distributed Representations of Words and Phrases and their Compositionality” in 2013 NeurIPS which proposes the negative sampling. The idea is very similar to NCE, which will be talked about later.
[2013 NeurIPS] [NCE] Learning Word Embeddings Efficiently with Noise-Contrastive Estimation
Sik-Ho Tsang. Review: Learning Word Embeddings Efficiently with Noise-Contrastive Estimation (NCE).