Understanding the output of training

facebookresearch / MUSE

A library for Multilingual Unsupervised or Supervised word Embeddings

Other

3.18k stars 544 forks source link

Understanding the output of training #121

Closed euler16 closed 5 years ago

euler16 commented 5 years ago

after training, in dumped/debug/xohu3xpdfn I get the following files (trained for english hindi)

best_mapping.pth
vectors-en.txt
vectors-hi.txt
params.pkl
train.log

are the .txt files containing mapped vectors? if not how can I obtain the mapping?

glample commented 5 years ago

Yes, the .txt are the mapped vectors. If English was the source during training, English embeddings will be identical to the input ones, and the Hindi ones will be mapped to the English embedding space.

euler16 commented 5 years ago

and best_mapping.pth is simply a matrix that converted the original hindi vectors to the mapped one right? Secondly, if I have to evaluate the the mapped vectors, do I simply run

python evaluate.py --src_lang en --tgt_lang hi --src_emb vectors-en.txt --tgt_emb vectors-hi.txt --max_vocab 200000

euler16 commented 5 years ago

The reason I wanted to clarify this was because evaluate.py uses trainer, I was under the impression that it trains from scratch and then evaluates.

glample commented 5 years ago

Correct, the best_mapping.pth just contains the matrix that projects the Hindi vectors to the English ones. And indeed, evaluate.py will not do any more training but just evaluate.

virendra-pathak commented 5 years ago

Yes, the .txt are the mapped vectors. If English was the source during training, English embeddings will be identical to the input ones, and the Hindi ones will be mapped to the English embedding space.

Hi @glample sorry for this basic question. (After going through code) In my understanding if English was the source embedding (src_emb) then English will be mapped to the Hindi embedding space? right?

As in trainer.py at line 263: logger.info("Map source embeddings to the target space ...")
for i, k in enumerate(range(0, len(src_emb), bs)):
x = Variable(src_emb[k:k + bs], volatile=True)
src_emb[k:k + bs] = self.mapping(x.cuda() if params.cuda else x).data.cpu()

Or am I missing something?

Thanks.

glample commented 5 years ago

Oh yeah my bad. I inverted English / Hindi in the previous post. The target embeddings are not moved, only the source are mapped to the target embeddings. So if --src_lang en --tgt_lang hi the Hindi embeddings will be identical (unless they are normalized), and the English ones will be mapped to them. Usually we set English as the target, as English embeddings are trained on more data and are of higher quality.

virendra-pathak commented 5 years ago

@glample Thanks for the clarification.

scofield7419 commented 4 years ago

Oh yeah my bad. I inverted English / Hindi in the previous post. The target embeddings are not moved, only the source are mapped to the target embeddings. So if --src_lang en --tgt_lang hi the Hindi embeddings will be identical (unless they are normalized), and the English ones will be mapped to them. Usually we set English as the target, as English embeddings are trained on more data and are of higher quality.

That matters a lot! In fact, you should explicitly make this clear at readme-home page.