Open joshdevins opened 3 years ago
Hi @joshdevins,
I also find this feature interesting and it's already planned to be added to the BEIR repository in the future.
I have started to integrate with Pyserini and we currently have Anserini-BM25 and RM3 expansion added in BEIR. Doc2query would be the next to be added to the repository and should be easy to add to the repo.
Regarding DeepCT, I would need to have a look at the original repository and check how easily it can be integrated with BEIR repo. Hopefully, it should not be difficult to integrate.
I shall update you once both methods have been added to the BEIR repository.
Kind Regards, Nandan
Ok that sounds great @NThakur20. I'm definitely more interested in doc2query as it performs much better in-domain than DeepCT, so even that additional datapoint in the benchmark would be really useful.
Hey @NThakur20, a colleague pointed me to a new paper that might also be interesting that kind of fits into the sparse lexical retrieval category. Looks like they have model checkpoints already but the indexing and retrieval is using custom indices (as far as I can tell).
COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List https://github.com/luyug/COIL
Hi @joshdevins,
Thanks for mentioning the resource. The paper you mentioned was presented in NAACL and I had a long chat with the paper authors regarding the same. We are in talks to integrate the COIL paper into the BEIR repository.
Kind Regards, Nandan
Hi @joshdevins,
I've added code to evaluate docT5query with the BEIR benchmark. You can a sample code here to run and evaluate it using Pyserini-BM25 - (link).
Kind Regards, Nandan Thakur
I'm gonna have a look at this soon. Can I add results here for inclusion in the Google Sheets "leaderboard"?
Hi @joshdevins,
Yes, I haven't been able to get time to run docT5query for all BEIR datasets, as you mentioned it takes time. Feel free to share results with docT5query on the BEIR datasets here. I would be happy to add them to the Google Sheets Leaderboard.
Also, finally, I have a working example up for DeepCT. The original DeepCT repository is quite old and only worked with TensorFlow 1.x, Had to modify it to the latest TensorFlow versions. You can find a sneak peek here: https://github.com/UKPLab/beir/blob/development/examples/retrieval/evaluation/sparse/evaluate_deepct.py
I would merge it with the main branch soon! I enjoyed your debate in Berlin Buzzwords 2021.
Kind Regards, Nandan
I'm running most of the docT5query examples now, but I won't have access to a couple datasets since they require accepting dataset use agreements. I'll post results for what I can run here.
Sounds good, Thanks @joshdevins! I look forward to the results 😊
Kind Regards, Nandan
Results for doc2query-T5 are as follows. As mentioned above, some datasets we don't have access to due to usage restrictions so they have been excluded. baseline
is defined here as the Anserini BM25 score.
dataset | baseline | score | +/- |
---|---|---|---|
msmarco |
0.228 | 0.5064 | 🔼 |
fever |
0.753 | 0.6926 | 🔽 |
climate-fever |
0.213 | 0.1772 | 🔽 |
hotpotqa |
0.603 | 0.5441 | 🔽 |
dbpedia-entity |
0.313 | 0.3012 | 🔽 |
nq |
0.329 | 0.3412 | 🔼 |
webis-touche2020 |
0.614 | 0.5246 | 🔽 |
trec-covid |
0.656 | 0.6609 | 🔼 |
quora |
0.742 | 0.7821 | 🔼 |
cqadupstack |
0.316 | 0.2937 | 🔽 |
fiqa |
0.236 | 0.2433 | 🔼 |
scidocs |
0.158 | 0.1558 | 🔽 |
scifact |
0.665 | 0.6599 | 🔽 |
I don't understand the msmarco
result here though. Something seems to be off on that one. I have the pyserini.jsonl
in case you want to have a look — I don't see anything obviously wrong with the index.
Some thoughts:
title
fields in the original corpus.json
, although this shouldn't really affect the score like that.
First, a thank you. The paper and repo have been fantastic resources to help conversations around out-of-domain retrieval!
Second, a feature request. I think it would be very interesting to see some of the document/index enrichment approaches added to the benchmark and paper discussion, as extensions to sparse lexical retrieval. You mention both doc2query and DeepCT/HDCT in the paper but don't provide benchmark data for them. Since they are trained on MS MARCO, it would be interesting to see if they perform well out-of-domain and in-comparison to both BM25+CE and ColBERT which perform very well out-of-domain.