Closed euler16 closed 5 years ago
Yes, the .txt are the mapped vectors. If English was the source during training, English embeddings will be identical to the input ones, and the Hindi ones will be mapped to the English embedding space.
and best_mapping.pth is simply a matrix that converted the original hindi vectors to the mapped one right? Secondly, if I have to evaluate the the mapped vectors, do I simply run
python evaluate.py --src_lang en --tgt_lang hi --src_emb vectors-en.txt --tgt_emb vectors-hi.txt --max_vocab 200000
The reason I wanted to clarify this was because evaluate.py uses trainer, I was under the impression that it trains from scratch and then evaluates.
Correct, the best_mapping.pth just contains the matrix that projects the Hindi vectors to the English ones. And indeed, evaluate.py will not do any more training but just evaluate.
Yes, the .txt are the mapped vectors. If English was the source during training, English embeddings will be identical to the input ones, and the Hindi ones will be mapped to the English embedding space.
Hi @glample sorry for this basic question. (After going through code) In my understanding if English was the source embedding (src_emb) then English will be mapped to the Hindi embedding space? right?
As in trainer.py at line 263:
logger.info("Map source embeddings to the target space ...")
for i, k in enumerate(range(0, len(src_emb), bs)):
x = Variable(src_emb[k:k + bs], volatile=True)
src_emb[k:k + bs] = self.mapping(x.cuda() if params.cuda else x).data.cpu()
Or am I missing something?
Thanks.
Oh yeah my bad. I inverted English / Hindi in the previous post. The target embeddings are not moved, only the source are mapped to the target embeddings. So if --src_lang en --tgt_lang hi
the Hindi embeddings will be identical (unless they are normalized), and the English ones will be mapped to them. Usually we set English as the target, as English embeddings are trained on more data and are of higher quality.
@glample Thanks for the clarification.
Oh yeah my bad. I inverted English / Hindi in the previous post. The target embeddings are not moved, only the source are mapped to the target embeddings. So if
--src_lang en --tgt_lang hi
the Hindi embeddings will be identical (unless they are normalized), and the English ones will be mapped to them. Usually we set English as the target, as English embeddings are trained on more data and are of higher quality.
That matters a lot! In fact, you should explicitly make this clear at readme-home page.
after training, in dumped/debug/xohu3xpdfn I get the following files (trained for english hindi)
are the .txt files containing mapped vectors? if not how can I obtain the mapping?