@inproceedings{athiwaratkun-wilson-2017-multimodal,
title = "Multimodal Word Distributions",
author = "Athiwaratkun, Ben and
Wilson, Andrew",
booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2017",
address = "Vancouver, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/P17-1151",
doi = "10.18653/v1/P17-1151",
pages = "1645--1656",
abstract = "Word embeddings provide point representations of words containing useful semantic information. We introduce multimodal word distributions formed from Gaussian mixtures, for multiple word meanings, entailment, and rich uncertainty information. To learn these distributions, we propose an energy-based max-margin objective. We show that the resulting approach captures uniquely expressive semantic information, and outperforms alternatives, such as word2vec skip-grams, and Gaussian embeddings, on benchmark datasets such as word similarity and entailment.",
}
1. What is it?
They proposed a gaussian mixture model for multi-prototype word embeddings
2. What is amazing compared to previous works?
Previous methods have two problems:
point vectors cannot represent rich information (e.g. hypernym)
word2gaussian is suffered from high variance in polysemous words
In this paper, they solve the upper problems by the gaussian mixture.
3. Where is the key to technologies and techniques
They define that the distribution of a target word w as follows:
p_w,i: the probability of i-th component (sense) of the word w
mu_w,i: i-th component (sense) of the word w initialized by word vectors
given word w, context c, and negative sample c', the energy-based max-margin objective is below:
4. How did evaluate it?
4.1 Nearest neighbors
From Table 1, their Gaussian mixture model (top) is able to express each meaning more effectively than the single Gaussian embedding (bottom).
4.2 Word similarity task
From Table 3, their model achieves higher performance than previous models.
4.3 Word similarity in context (#202)
From Table 4, their model achieves comparable results to the previous wordnet-based method (#204).
5. Is there a discussion?
5.1 Gaussian embeddings vs. Gaussian mixture
They show that their Gaussian mixture model outperforms the Gaussian embedding.
They claim that modeling polysemous words with only one distribution makes a high variance in the single Gaussian embedding.
0. Paper
@inproceedings{athiwaratkun-wilson-2017-multimodal, title = "Multimodal Word Distributions", author = "Athiwaratkun, Ben and Wilson, Andrew", booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2017", address = "Vancouver, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/P17-1151", doi = "10.18653/v1/P17-1151", pages = "1645--1656", abstract = "Word embeddings provide point representations of words containing useful semantic information. We introduce multimodal word distributions formed from Gaussian mixtures, for multiple word meanings, entailment, and rich uncertainty information. To learn these distributions, we propose an energy-based max-margin objective. We show that the resulting approach captures uniquely expressive semantic information, and outperforms alternatives, such as word2vec skip-grams, and Gaussian embeddings, on benchmark datasets such as word similarity and entailment.", }
1. What is it?
They proposed a gaussian mixture model for multi-prototype word embeddings
2. What is amazing compared to previous works?
Previous methods have two problems:
In this paper, they solve the upper problems by the gaussian mixture.
3. Where is the key to technologies and techniques
They define that the distribution of a target word w as follows:
given word w, context c, and negative sample c', the energy-based max-margin objective is below:
4. How did evaluate it?
4.1 Nearest neighbors
From Table 1, their Gaussian mixture model (top) is able to express each meaning more effectively than the single Gaussian embedding (bottom).
4.2 Word similarity task
From Table 3, their model achieves higher performance than previous models.
4.3 Word similarity in context (#202)
From Table 4, their model achieves comparable results to the previous wordnet-based method (#204).
5. Is there a discussion?
5.1 Gaussian embeddings vs. Gaussian mixture
They show that their Gaussian mixture model outperforms the Gaussian embedding. They claim that modeling polysemous words with only one distribution makes a high variance in the single Gaussian embedding.
6. Which paper should read next?