Purpose of EOS character?

bkj commented 7 years ago

Hi All --

I was wondering whether someone could weigh in on whether the EOS character is necessary? It seems like it acts like a bias term -- a vector that is added the representation of every input example. Does this empirically yield better performance?

The motivation for the question is this -- I'm using the supervised model w/ negative sampling to train a classifier w/ > 1M output classes. I want to use a word to query the output embeddings -- eg, enter the word "sports" and find the labels that are most similar to the word "sports". But ATM it seems like the input vectors (eg word vectors) and output vectors (eg label embeddings) are separated. I was wondering whether removing the <s> "bias term" would fix the issue. Any thoughts?

EdouardGrave commented 7 years ago

Hi @bkj,

You are correct, since the EOS token is added to all examples, it can be interpreted as a bias term. From my experiments, I would say that the presence or absence of this token does not have an important influence on the performance of the model. Practically, it is also useful to include because because there are no "empty sentences" (sentences without words or with only out-of-vocabulary words).

What do you mean by separated? Before comparing input and output vectors, you should probably normalize them (so that they have an L2 norm equal to 1). Also, what loss are you using to train the model?

bkj commented 7 years ago

By "separated" I mean that the mean word-word distance and mean label-label distance are smaller than the mean word-label distance. I can upload an illustrative plot in a few hours. I would've expected the distributions to be the same -- with the point cloud of word embeddings and label embeddings totally superimposed over one another.

I'm using cosine similarity (dot product of L2 normalized vectors. Does that seem reasonable to you?

I'm using the negative sampling ns loss, since that seemed the most reasonable for such a large number of classes. From prior experiments, I'd guess that softmax would be too slow, but I haven't really played w/ hs at all.

bkj commented 7 years ago

Here's a plot illustrating what I'm talking about.

Generated by

word_sel = np.random.choice(nword_vec.shape[0], 5000)
word_sims = nword_vec[word_sel].dot(nword_vec[word_sel].T)

user_sel = np.random.choice(nuser_vec.shape[0], 5000)
user_sims = nuser_vec[user_sel].dot(nuser_vec[user_sel].T)

cross_sims = nuser_vec[user_sel].dot(nword_vec[word_sel].T)

_ = plt.hist(np.hstack(word_sims), 100, alpha=0.2, label='word-word')
_ = plt.hist(np.hstack(user_sims), 100, alpha=0.2, label='user-user')
_ = plt.hist(np.hstack(cross_sims), 100, alpha=0.2, label='word-user')
_ = plt.legend(loc='upper left')
_ = plt.xlabel('cosine sim')
_ = plt.ylabel('count')
show_plot()

where nuser_vec and nword_vec are L2-normalized label and word vectors, respectively. The fact that the mean cosine similarity between words and labels is < 0 indicates some difference in location of the point clouds. I guess this isn't particularly surprising if there's an offset -- I can modify the code to ignore EOS, re-run and see what happens. Will report back when that's done, but in the meantime let me know if you think I'm misinterpreting anything here.

loretoparisi commented 7 years ago

@bk How did you normalized the labels and word vectors (numpy, etc..)? Did you try the test without the EOS char as well?

facebookresearch / fastText

Purpose of EOS character? #177