McGill-NLP / llm2vec

Code for 'LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders'
https://mcgill-nlp.github.io/llm2vec/
MIT License
816 stars 59 forks source link

MNTP Question #73

Closed bdytx5 closed 1 month ago

bdytx5 commented 1 month ago

Hi, great work on this!

Just had a question about the MNTP. In the paper, you mention " when predicting a masked token at position i, we compute the loss based on the logits obtained from the token representation at the previous position i − 1, not the masked position itself "

I was a bit confused about this and also why this is? Could you provide a more detailed explanation to this and the intuition behind it?

Thanks, Brett

vaibhavad commented 1 month ago

Hi @bdytx5,

thanks for your interest in our work. We did this to align our training objective with the pre-training setup of decoder-only LLMs. Decoder only LMs are trained to predict the token at position i by using the embedding of token at position i-1. By making sure our training objective follows a similar pattern, the intuition is that we will maximally use the inherent capabilities of the model.

Let me know if you have any further questions.

bdytx5 commented 1 month ago

ok, thanks!