beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
http://beir.ai
Apache License 2.0
1.61k stars 191 forks source link

Include enriched sparse lexical retrieval methods #10

Open joshdevins opened 3 years ago

joshdevins commented 3 years ago

First, a thank you. The paper and repo have been fantastic resources to help conversations around out-of-domain retrieval!

Second, a feature request. I think it would be very interesting to see some of the document/index enrichment approaches added to the benchmark and paper discussion, as extensions to sparse lexical retrieval. You mention both doc2query and DeepCT/HDCT in the paper but don't provide benchmark data for them. Since they are trained on MS MARCO, it would be interesting to see if they perform well out-of-domain and in-comparison to both BM25+CE and ColBERT which perform very well out-of-domain.

thakur-nandan commented 3 years ago

Hi @joshdevins,

I also find this feature interesting and it's already planned to be added to the BEIR repository in the future.

I have started to integrate with Pyserini and we currently have Anserini-BM25 and RM3 expansion added in BEIR. Doc2query would be the next to be added to the repository and should be easy to add to the repo.

Regarding DeepCT, I would need to have a look at the original repository and check how easily it can be integrated with BEIR repo. Hopefully, it should not be difficult to integrate.

I shall update you once both methods have been added to the BEIR repository.

Kind Regards, Nandan

joshdevins commented 3 years ago

Ok that sounds great @NThakur20. I'm definitely more interested in doc2query as it performs much better in-domain than DeepCT, so even that additional datapoint in the benchmark would be really useful.

joshdevins commented 3 years ago

Hey @NThakur20, a colleague pointed me to a new paper that might also be interesting that kind of fits into the sparse lexical retrieval category. Looks like they have model checkpoints already but the indexing and retrieval is using custom indices (as far as I can tell).

COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List https://github.com/luyug/COIL

thakur-nandan commented 3 years ago

Hi @joshdevins,

Thanks for mentioning the resource. The paper you mentioned was presented in NAACL and I had a long chat with the paper authors regarding the same. We are in talks to integrate the COIL paper into the BEIR repository.

Kind Regards, Nandan

thakur-nandan commented 3 years ago

Hi @joshdevins,

I've added code to evaluate docT5query with the BEIR benchmark. You can a sample code here to run and evaluate it using Pyserini-BM25 - (link).

Kind Regards, Nandan Thakur

joshdevins commented 3 years ago

I'm gonna have a look at this soon. Can I add results here for inclusion in the Google Sheets "leaderboard"?

thakur-nandan commented 3 years ago

Hi @joshdevins,

Yes, I haven't been able to get time to run docT5query for all BEIR datasets, as you mentioned it takes time. Feel free to share results with docT5query on the BEIR datasets here. I would be happy to add them to the Google Sheets Leaderboard.

Also, finally, I have a working example up for DeepCT. The original DeepCT repository is quite old and only worked with TensorFlow 1.x, Had to modify it to the latest TensorFlow versions. You can find a sneak peek here: https://github.com/UKPLab/beir/blob/development/examples/retrieval/evaluation/sparse/evaluate_deepct.py

I would merge it with the main branch soon! I enjoyed your debate in Berlin Buzzwords 2021.

Kind Regards, Nandan

joshdevins commented 3 years ago

I'm running most of the docT5query examples now, but I won't have access to a couple datasets since they require accepting dataset use agreements. I'll post results for what I can run here.

thakur-nandan commented 3 years ago

Sounds good, Thanks @joshdevins! I look forward to the results 😊

Kind Regards, Nandan

joshdevins commented 3 years ago

Results for doc2query-T5 are as follows. As mentioned above, some datasets we don't have access to due to usage restrictions so they have been excluded. baseline is defined here as the Anserini BM25 score.

dataset baseline score +/-
msmarco 0.228 0.5064 🔼
fever 0.753 0.6926 🔽
climate-fever 0.213 0.1772 🔽
hotpotqa 0.603 0.5441 🔽
dbpedia-entity 0.313 0.3012 🔽
nq 0.329 0.3412 🔼
webis-touche2020 0.614 0.5246 🔽
trec-covid 0.656 0.6609 🔼
quora 0.742 0.7821 🔼
cqadupstack 0.316 0.2937 🔽
fiqa 0.236 0.2433 🔼
scidocs 0.158 0.1558 🔽
scifact 0.665 0.6599 🔽

I don't understand the msmarco result here though. Something seems to be off on that one. I have the pyserini.jsonl in case you want to have a look — I don't see anything obviously wrong with the index.

Some thoughts: