ContextualAI / gritlm

Generative Representational Instruction Tuning
https://arxiv.org/abs/2402.09906
MIT License
475 stars 33 forks source link

For document clustering, should we leave instruction blank? #6

Open griff4692 opened 4 months ago

griff4692 commented 4 months ago

Thanks -- I am using Grit for document embeddings that will be used to score doc-to-doc similarity.

Should I add an instruction or leave it blank?

Thank you, Griffin

Muennighoff commented 4 months ago

Thanks -- I am using Grit for document embeddings that will be used to score doc-to-doc similarity.

Should I add an instruction or leave it blank?

Thank you, Griffin

So is it like STS rather than Retrieval? I would probably add them in that case, but it may make sense to try both.

griff4692 commented 4 months ago

Thanks -- I am using Grit for document embeddings that will be used to score doc-to-doc similarity. Should I add an instruction or leave it blank? Thank you, Griffin

So is it like STS rather than Retrieval? I would probably add them in that case, but it may make sense to try both.

Thanks for the reply! Yes - in order to cluster documents for in-context pre-training (https://arxiv.org/abs/2310.10638).

Was going to try "Identify the main topics from a medical document." but wasn't sure how instructions for embeddings are meant to be worded for gritlm.

Muennighoff commented 4 months ago

Thanks -- I am using Grit for document embeddings that will be used to score doc-to-doc similarity. Should I add an instruction or leave it blank? Thank you, Griffin

So is it like STS rather than Retrieval? I would probably add them in that case, but it may make sense to try both.

Thanks for the reply! Yes - in order to cluster documents for in-context pre-training (https://arxiv.org/abs/2310.10638).

Was going to try "Identify the main topics from a medical document." but wasn't sure how instructions for embeddings are meant to be worded for gritlm.

Yeah I think for clustering you'll get slightly better performance if you include an instruction. The one you proposed sounds good to me!