Why not use BEIR? - Githubissues

Muennighoff commented 2 years ago

Hey guys, awesome work. Simple question, why did you not just use BEIR (& possibly extend it with training datasets)?

kwang2049 commented 2 years ago

Hi @Muennighoff,

Thanks for your question. This is very an interesting one!

This was because, the focus was different at that time: This paper and its evaluation code are dedicated to sentence-level similarity tasks, which are totally different from the setting of queries and documents in the IR tasks from BeIR.

Actually, one can refer to my latest work "GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval". We also tried TSDAE on BeIR and we find it works well for domain-adaptation usage (cf. Table 1), but it works poorly in the purely unsupervised setting (cf. Table 9 in the Appendix).

Muennighoff commented 2 years ago

I see, thanks for your fast reply

i guess it depends on the dataset - Quora & ArguAna from BEIR for example contain duplicates across queries and documents and hence have similar avg lens for both, so maybe closer to sentence-level similarity
you're BioASQ score of 70.7 for BM25 in GPL Table1 seems a bit off compared to the scores in the BEIR paper

Muennighoff commented 2 years ago

Also if you don't mind me asking here:

Is the MLM column in Table 9, using pre-trained DistilBERT MEAN pooling to encode without any further training? Or is it MLM done on MS-Marco?

Just for feedback, it's a bit confusing having "MS MARCO" as a model in the table, which is a dataset in most other papers in this niche.

kwang2049 commented 2 years ago

I see, thanks for your fast reply

i guess it depends on the dataset - Quora & ArguAna from BEIR for example contain duplicates across queries and documents and hence have similar avg lens for both, so maybe closer to sentence-level similarity

you're BioASQ score of 70.7 for BM25 in GPL Table1 seems a bit off compared to the scores in the BEIR paper

Yes, I would say Quora is a sentence-level task (although we did not use it in the TSDAE paper, since the language is quite general). And we made the difference of focus more clear between these two papers, TSDAE (for sentence-level tasks) and GPL (for IR tasks).
Two main differences that influence the result: (1) We downsampled the corpus to 1M from 15M for efficient training&evaluation (this of course changed the absolute scores a lot); (2) we concatenated the titles and bodies to form a document (in the GPL paper) instead of using multiple fields (in the BeIR paper), since I think this way is more consistent among both sparse and dense retrieval (actually this does not change the score too much, e.g. <1.0 difference).

kwang2049 commented 2 years ago

Also if you don't mind me asking here:

Is the MLM column in Table 9, using pre-trained DistilBERT MEAN pooling to encode without any further training? Or is it MLM done on MS-Marco?

Just for feedback, it's a bit confusing having "MS MARCO" as a model in the table, which is a dataset in most other papers in this niche.

The setting is: Do MLM training on the target corpus (e.g. FiQA) and then directly do the evaluation (yeah, with mean-pooling) on the target dataset without any other training. For other unsupervised/pre-training methods, the settings are the same.
Thanks for your comments! Yes, we were also quite unsure about how to name the model. We thought about using "MarginMSE", but that is also ambiguous since the new method GPL also involves MarginMSE. We will consider other options if possible.

Muennighoff commented 2 years ago

Also if you don't mind me asking here: Is the MLM column in Table 9, using pre-trained DistilBERT MEAN pooling to encode without any further training? Or is it MLM done on MS-Marco? Just for feedback, it's a bit confusing having "MS MARCO" as a model in the table, which is a dataset in most other papers in this niche.

The setting is: Do MLM training on the target corpus (e.g. FiQA) and then directly do the evaluation (yeah, with mean-pooling) on the target dataset without any other training. For other unsupervised/pre-training methods, the settings are the same.

Thanks for your comments! Yes, we were also quite unsure about how to name the model. We thought about using "MarginMSE", but that is also ambiguous since the new method GPL also involves MarginMSE. We will consider other options if possible.

Makes sense, thanks a lot!

UKPLab / useb

Why not use BEIR? #1