epfml / sent2vec

General purpose unsupervised sentence representations
Other
1.19k stars 256 forks source link

Add CLI to print both unigram embedding matrices #70

Closed jbcdnr closed 5 years ago

jbcdnr commented 5 years ago

Good sanity check indeed: Matrices are the same but the C++ code prints the token <PLACEHOLDER> and its vector first while the python method returns it at the last row. I think index 0 is more consistent with other tokenizers, what do you think ?

mpagli commented 5 years ago

The Python code sorts the tokens by frequency before fetching the vectors, as the <PLACEHOLDER> token is artificial, his frequency is set to 0 and ends up at the end of the matrix. We could bring the two methods in sync by forcing the <PLACEHOLDER> token to be first:

def get_unigram_embeddings(self):
    vocab = list(self.get_vocabulary().items())
    vocab.sort(key=lambda x: x[1], reverse=True)
    vocab = [w for w, c in vocab]
    if vocab[-1] == '<PLACEHOLDER>':
        vocab[-1], vocab[0] = vocab[0], vocab[-1]
    return self.embed_sentences(vocab), vocab
jbcdnr commented 5 years ago

The Python code sorts the tokens by frequency before fetching the vectors

Is there a reason for this? I would prefer to get the same order as in the FastText model. I modified the python binding with my last commit in this direction.