Open bkj opened 7 years ago
Hi @bkj,
You are correct, since the EOS token is added to all examples, it can be interpreted as a bias term. From my experiments, I would say that the presence or absence of this token does not have an important influence on the performance of the model. Practically, it is also useful to include because because there are no "empty sentences" (sentences without words or with only out-of-vocabulary words).
What do you mean by separated? Before comparing input and output vectors, you should probably normalize them (so that they have an L2 norm equal to 1). Also, what loss are you using to train the model?
By "separated" I mean that the mean word-word distance and mean label-label distance are smaller than the mean word-label distance. I can upload an illustrative plot in a few hours. I would've expected the distributions to be the same -- with the point cloud of word embeddings and label embeddings totally superimposed over one another.
I'm using cosine similarity (dot product of L2 normalized vectors. Does that seem reasonable to you?
I'm using the negative sampling ns
loss, since that seemed the most reasonable for such a large number of classes. From prior experiments, I'd guess that softmax
would be too slow, but I haven't really played w/ hs
at all.
Here's a plot illustrating what I'm talking about.
Generated by
word_sel = np.random.choice(nword_vec.shape[0], 5000)
word_sims = nword_vec[word_sel].dot(nword_vec[word_sel].T)
user_sel = np.random.choice(nuser_vec.shape[0], 5000)
user_sims = nuser_vec[user_sel].dot(nuser_vec[user_sel].T)
cross_sims = nuser_vec[user_sel].dot(nword_vec[word_sel].T)
_ = plt.hist(np.hstack(word_sims), 100, alpha=0.2, label='word-word')
_ = plt.hist(np.hstack(user_sims), 100, alpha=0.2, label='user-user')
_ = plt.hist(np.hstack(cross_sims), 100, alpha=0.2, label='word-user')
_ = plt.legend(loc='upper left')
_ = plt.xlabel('cosine sim')
_ = plt.ylabel('count')
show_plot()
where nuser_vec
and nword_vec
are L2-normalized label and word vectors, respectively. The fact that the mean cosine similarity between words and labels is < 0 indicates some difference in location of the point clouds. I guess this isn't particularly surprising if there's an offset -- I can modify the code to ignore EOS, re-run and see what happens. Will report back when that's done, but in the meantime let me know if you think I'm misinterpreting anything here.
@bk How did you normalized the labels and word vectors (numpy, etc..)? Did you try the test without the EOS
char as well?
Hi All --
I was wondering whether someone could weigh in on whether the EOS character is necessary? It seems like it acts like a bias term -- a vector that is added the representation of every input example. Does this empirically yield better performance?
The motivation for the question is this -- I'm using the
supervised
model w/ negative sampling to train a classifier w/ > 1M output classes. I want to use a word to query the output embeddings -- eg, enter the word "sports" and find the labels that are most similar to the word "sports". But ATM it seems like the input vectors (eg word vectors) and output vectors (eg label embeddings) are separated. I was wondering whether removing the<s>
"bias term" would fix the issue. Any thoughts?