bmschmidt / wordVectors

An R package for creating and exploring word2vec and other word embedding models
Other
280 stars 78 forks source link

Add function to align different models #18

Open bmschmidt opened 7 years ago

bmschmidt commented 7 years ago

This Stanford paper describes the most promising method I've seen so far for aligning multiple different models; it would be a useful addition here.

In order to compare word vectors from differ- ent time-periods we must ensure that the vectors are aligned to the same coordinate axes. Ex- plicit PPMI vectors are naturally aligned, as each column simply corresponds to a context word. Low-dimensional embeddings will not be natu- rally aligned due to the non-unique nature of the SVD and the stochastic nature of SGNS. In par- ticular, both these methods may result in arbi- trary orthogonal transformations, which do not af- fect pairwise cosine-similarities within-years but will preclude comparison of the same word across time. Previous work circumvented this problem by either avoiding low-dimensional embeddings (e.g., Gulordava and Baroni, 2011; Jatowt and Duh, 2014) or by performing heuristic local align- ments per word (Kulkarni et al., 2014). We use orthogonal Procrustes to align the learned low-dimensional embeddings. Defining W(t) ∈ Rd×|V| as the matrix of word embeddings learned at year t, we align across time-periods while preserving cosine similarities by optimizing: R(t) = arg min ∥W(t)Q − W(t+1)∥F , (4) Q⊤ Q=I with R(t) ∈ Rd×d. The solution corresponds to the best rotational alignment and can be obtained efficiently using an application of SVD (Scho ̈nemann, 1966).

benmarwick commented 4 years ago

I'm looking at your intriguing blog post of 23 Dec, 2016 and examining this plot, below, and it seems like you somehow solved this problem of tracing the movement of a word over time, while holding the context words in a constant location. image

Would it be possible to know roughly how you accomplished this for that figure? Did you replace empire with, for example, empire_1979-2010 in the text, and then generate a single embedding for the entire corpus including documents from all periods? Or did you generate embeddings for each period and somehow align them, etc.? Thank you.

benmarwick commented 4 years ago

The simplest approach I've found is the 'Temporal Referencing' method described in:

Haim Dubossarsky, Simon Hengchen, Nina Tahmasebi and Dominik Schlechtweg. 2019. Time-Out: Temporal Referencing for Robust Modeling of Lexical Semantic Change. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Florence, Italy. Association for Computational Linguistics.

image

That's giving me quite good results, so I'd recommend it.