NorbertZheng / read-papers

My paper reading notes.
MIT License
7 stars 0 forks source link

Sik-Ho Tsang | Review: Learning Word Embeddings Efficiently with Noise-Contrastive Estimation (NCE). #62

Closed NorbertZheng closed 1 year ago

NorbertZheng commented 1 year ago

Sik-Ho Tsang. Review: Learning Word Embeddings Efficiently with Noise-Contrastive Estimation (NCE).

NorbertZheng commented 1 year ago

Overview

Using Noise-Contrastive Estimation (NCE) for Efficient Learning.

In this story, Learning Word Embeddings Efficiently with Noise-Contrastive Estimation, (NCE), by DeepMind, is briefly reviewed.

In this paper:

This is a paper in 2013 NeurIPS with over 600 citations. Noise-Contrastive Estimation (NCE) is another basic concept of Contrastive Learning in self-supervised learning.

NorbertZheng commented 1 year ago

Softmax in Word2Vec

$$ P{\theta}^{h}(w)=\frac{\exp(s{\theta}(w,h))}{\sum{w'}\exp(s{\theta}(w',h))}. $$

NorbertZheng commented 1 year ago

Noise-Contrastive Estimation (NCE)

Definition of Noise-Contrastive Estimation (NCE)

NCE is based on

The basic idea is to train a logistic regression classifier to discriminate between samples from the data distribution and samples from some “noise” distribution, based on the ratio of probabilities of the sample under the model and the noise distribution.

$$ P{h;\theta}(D=1|w)=\frac{P{h;\theta}(w)}{P{h;\theta}((w)+kP{n}(w)}=\sigma(\Delta s_{\theta}(w,h)), $$

By using NCE, the summation originated at the denominator of the Softmax function can be skipped. Because NCE is an unnormalized model.

NCE training time is linear in the number of noise samples and independent of the vocabulary size.

As we increase the number of noise samples $k$, this estimate approaches the likelihood gradient of the normalized model.

image

image

NorbertZheng commented 1 year ago

Log-Bilinear Language (LBL) Models

NorbertZheng commented 1 year ago

Experimental Results

image Accuracy in percent on word similarity tasks.

image Accuracy in percent on word similarity tasks for large models.

image Results for various models trained for 20 epochs on the 47M-word Gutenberg dataset using NCE5 with AdaGrad ((D) and (I) denote models with and without position-dependent weights respectively).

image Accuracy on the MSR Sentence Completion Challenge dataset.

NorbertZheng commented 1 year ago

Another work, which is “Distributed Representations of Words and Phrases and their Compositionality” in 2013 NeurIPS which proposes the negative sampling. The idea is very similar to NCE, which will be talked about later.

NorbertZheng commented 1 year ago

Reference

[2013 NeurIPS] [NCE] Learning Word Embeddings Efficiently with Noise-Contrastive Estimation

NorbertZheng commented 1 year ago