ContextualAI / gritlm

Generative Representational Instruction Tuning
https://arxiv.org/abs/2402.09906
MIT License
566 stars 40 forks source link

meaning of "neg" in embedding dataset #64

Open zhj2022 opened 2 hours ago

zhj2022 commented 2 hours ago

According to the paper, GritLM uses in-batch negative as negative samples for contrastive learning. But in the toy embedding dataset, the json contains the key "neg", which cannot be removed. Thus I don't understand why we need some more negative samples when we already have in-batch negatives and how they are constructed.(From the toy dataset I don't think they are hard negatives of the query sentence.)

Muennighoff commented 2 hours ago

These are hard negatives; it uses both in-batch negatives & hard negatives You can make it not use the hard negatives by setting train group size to 1 I think