UKPLab / useb

Heterogenous, Task- and Domain-Specific Benchmark for Unsupervised Sentence Embeddings used in the TSDAE paper: https://arxiv.org/abs/2104.06979.
Apache License 2.0
32 stars 2 forks source link

Why not use BEIR? #1

Closed Muennighoff closed 2 years ago

Muennighoff commented 2 years ago

Hey guys, awesome work. Simple question, why did you not just use BEIR (& possibly extend it with training datasets)?

kwang2049 commented 2 years ago

Hi @Muennighoff,

Thanks for your question. This is very an interesting one!

This was because, the focus was different at that time: This paper and its evaluation code are dedicated to sentence-level similarity tasks, which are totally different from the setting of queries and documents in the IR tasks from BeIR.

Actually, one can refer to my latest work "GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval". We also tried TSDAE on BeIR and we find it works well for domain-adaptation usage (cf. Table 1), but it works poorly in the purely unsupervised setting (cf. Table 9 in the Appendix).

Muennighoff commented 2 years ago

I see, thanks for your fast reply

Muennighoff commented 2 years ago

Also if you don't mind me asking here:

Is the MLM column in Table 9, using pre-trained DistilBERT MEAN pooling to encode without any further training? Or is it MLM done on MS-Marco?

Just for feedback, it's a bit confusing having "MS MARCO" as a model in the table, which is a dataset in most other papers in this niche.

kwang2049 commented 2 years ago

I see, thanks for your fast reply

  • i guess it depends on the dataset - Quora & ArguAna from BEIR for example contain duplicates across queries and documents and hence have similar avg lens for both, so maybe closer to sentence-level similarity
  • you're BioASQ score of 70.7 for BM25 in GPL Table1 seems a bit off compared to the scores in the BEIR paper
kwang2049 commented 2 years ago

Also if you don't mind me asking here:

Is the MLM column in Table 9, using pre-trained DistilBERT MEAN pooling to encode without any further training? Or is it MLM done on MS-Marco?

Just for feedback, it's a bit confusing having "MS MARCO" as a model in the table, which is a dataset in most other papers in this niche.

  1. The setting is: Do MLM training on the target corpus (e.g. FiQA) and then directly do the evaluation (yeah, with mean-pooling) on the target dataset without any other training. For other unsupervised/pre-training methods, the settings are the same.
  2. Thanks for your comments! Yes, we were also quite unsure about how to name the model. We thought about using "MarginMSE", but that is also ambiguous since the new method GPL also involves MarginMSE. We will consider other options if possible.
Muennighoff commented 2 years ago

Also if you don't mind me asking here: Is the MLM column in Table 9, using pre-trained DistilBERT MEAN pooling to encode without any further training? Or is it MLM done on MS-Marco? Just for feedback, it's a bit confusing having "MS MARCO" as a model in the table, which is a dataset in most other papers in this niche.

  1. The setting is: Do MLM training on the target corpus (e.g. FiQA) and then directly do the evaluation (yeah, with mean-pooling) on the target dataset without any other training. For other unsupervised/pre-training methods, the settings are the same.
  2. Thanks for your comments! Yes, we were also quite unsure about how to name the model. We thought about using "MarginMSE", but that is also ambiguous since the new method GPL also involves MarginMSE. We will consider other options if possible.

Makes sense, thanks a lot!