The skip-gram model assumes that a word can be used to generate its surrounding words in a text sequence. Take the text sequence “the”, “man”, “loves”, “his”, “son” as an example. Let us choose “loves” as the center word and set the context window size to 2. As shown in Fig. 14.1.1, given the center word “loves”, the skip-gram model considers the conditional probability for generating the context words: “the”, “man”, “his”, and “son”, which are no more than 2 words away from the center word:
𝑃("the","man","his","son"∣"loves").
Assume that the context words are independently generated given the center word (i.e., conditional independence). In this case, the above conditional probability can be rewritten as
In the skip-gram model, each word has two 𝑑 -dimensional-vector representations for calculating conditional probabilities. More concretely, for any word with index 𝑖 in the dictionary, denote by 𝐯𝑖∈ℝ𝑑 and 𝐮𝑖∈ℝ𝑑 its two vectors when used as a center word and a context word, respectively.
The conditional probability of generating any context word 𝑤𝑜 (with index 𝑜 in the dictionary) given the center word 𝑤𝑐 (with index 𝑐 in the dictionary) can be modeled by a softmax operation on vector dot products:
After training, for any word with index 𝑖 in the dictionary, we obtain both word vectors 𝐯𝑖 (as the center word) and 𝐮𝑖 (as the context word). In natural language processing applications, the center word vectors of the skip-gram model are typically used as the word representations.
def skip_gram(center, contexts_and_negatives, embed_v, embed_u):
v = embed_v(center)
u = embed_u(contexts_and_negatives)
pred = torch.bmm(v, u.permute(0, 2, 1))
return pred
Cosine similarity and Dot product are both similarity measures but dot product is magnitude sensitive while cosine similarity is not. Depending on the occurance count of a word it might have a large or small dot product with another word. We normally normalize our vector to prevent this effect so all vectors have unit magnitude. But if your particular downstream task requires occurance count as a feature then dot product might be the way to go, but if you do not care about counts then you can simlpy calculate the cosine similarity which will normalize them.
The skip-gram model assumes that a word can be used to generate its surrounding words in a text sequence. Take the text sequence “the”, “man”, “loves”, “his”, “son” as an example. Let us choose “loves” as the center word and set the context window size to 2. As shown in Fig. 14.1.1, given the center word “loves”, the skip-gram model considers the conditional probability for generating the context words: “the”, “man”, “his”, and “son”, which are no more than 2 words away from the center word:
𝑃("the","man","his","son"∣"loves").
Assume that the context words are independently generated given the center word (i.e., conditional independence). In this case, the above conditional probability can be rewritten as
𝑃("the"∣"loves")⋅𝑃("man"∣"loves")⋅𝑃("his"∣"loves")⋅𝑃("son"∣"loves").
In the skip-gram model, each word has two 𝑑 -dimensional-vector representations for calculating conditional probabilities. More concretely, for any word with index 𝑖 in the dictionary, denote by 𝐯𝑖∈ℝ𝑑 and 𝐮𝑖∈ℝ𝑑 its two vectors when used as a center word and a context word, respectively.
The conditional probability of generating any context word 𝑤𝑜 (with index 𝑜 in the dictionary) given the center word 𝑤𝑐 (with index 𝑐 in the dictionary) can be modeled by a softmax operation on vector dot products:
After training, for any word with index 𝑖 in the dictionary, we obtain both word vectors 𝐯𝑖 (as the center word) and 𝐮𝑖 (as the context word). In natural language processing applications, the center word vectors of the skip-gram model are typically used as the word representations.
http://d2l.ai/chapter_natural-language-processing-pretraining/word2vec.html#the-skip-gram-model
or
Why we train word2vec with dot product as the similarity measure but use the cosine similarity after the model is trained?
https://stackoverflow.com/questions/54411020/why-use-cosine-similarity-in-word2vec-when-its-trained-using-dot-product-similar
Cosine similarity and Dot product are both similarity measures but dot product is magnitude sensitive while cosine similarity is not. Depending on the occurance count of a word it might have a large or small dot product with another word. We normally normalize our vector to prevent this effect so all vectors have unit magnitude. But if your particular downstream task requires occurance count as a feature then dot product might be the way to go, but if you do not care about counts then you can simlpy calculate the cosine similarity which will normalize them.