About training data of embedding

ContextualAI / gritlm

Generative Representational Instruction Tuning

https://arxiv.org/abs/2402.09906

MIT License

479 stars 33 forks source link

About training data of embedding #38

Closed zillion-zhao closed 3 weeks ago

zillion-zhao commented 1 month ago

Hello. I see that in the toy_data of embedding, each query have several negative samples, while when I download MEDI2 (parquet), each query has only one negative sample. I want to know which setting is used for training? The MEDI2.parquet can be directly used for training, or I should take additional actions?

Muennighoff commented 1 month ago

We use only one hard negative during training. This is controlled via --train_group_size 2 where if it is 2 then only 1 hard negative will be chosen. You can directly use it for training.

zillion-zhao commented 1 month ago

OK, I just notice this option. Thank you for your kind reply.