Open calpt opened 9 months ago
Interesting, this is a really great analysis! We also noticed this and have been working on an update to the benchmark (LoCoV1). We haven't put it out yet but will do soon (and add this as a great baseline).
CC @jonsaadfalcon
Thank you for sharing @calpt! If you have an evaluation script for BM25 available, I'd love to take a look and try it out on our new evaluation datasets.
+1, would love to see the script @calpt! The scores a a good bit higher than when we ran BM25 internally so would love to see if we did something wrong!
Sure, I basically just took your loco_eval.py script, removed everything but the data loading, plugged in the BM25 implementation & eval of BEIR (roughly like this: https://gist.github.com/calpt/56d0d47724a061c4a7bd4a9a8fd990d2) and spun up a local ES docker container.
Looking forward to LoCo v1!
Great, we’ll take a look! CC @jonsaadfalcon
On Thu, Feb 8, 2024 at 10:08 AM calpt @.***> wrote:
Sure, I basically just took your loco_eval.py script, removed everything but the data loading, plugged in the BM25 implementation & eval of BEIR (roughly like this: https://gist.github.com/calpt/56d0d47724a061c4a7bd4a9a8fd990d2) and spun up a local ES docker container https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started.html .
Looking forward to LoCo v1!
— Reply to this email directly, view it on GitHub https://github.com/HazyResearch/m2/issues/23#issuecomment-1934678100, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDDIITWTPU3RZTC6YL5P2DYSUIAXAVCNFSM6AAAAABC6KVFQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZUGY3TQMJQGA . You are receiving this because you commented.Message ID: @.***>
Hey @DanFu09 would love to know if you have an update on this!
Our team, at Cohere, will likely report on an adjusted version of QMSum (What @calpt described above)
Hi Elliott, thanks for the interest!
We have an updated LoCoV1 described in the arXiv ( https://arxiv.org/abs/2402.07440v2) - will have it on HF with updated checkpoints soon (we ran into ICLR rebuttals before we got a chance to clean it up for upload).
If you DM/email me and Jon we can try to share access to the private HF dataset?
On Thu, Mar 28, 2024 at 9:57 AM Elliott Choi @.***> wrote:
Hey @DanFu09 https://github.com/DanFu09 would love to know if you have an update on this!
Our team, at Cohere, will likely report on an adjusted version of QMSum (What @calpt https://github.com/calpt described above)
— Reply to this email directly, view it on GitHub https://github.com/HazyResearch/m2/issues/23#issuecomment-2025250955, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDDIIW733ORKRYBEDX6FO3Y2QOUPAVCNFSM6AAAAABC6KVFQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRVGI2TAOJVGU . You are receiving this because you were mentioned.Message ID: @.***>
Hello @DanFu09! I found this benchmark quite exciting and was wondering if you got the chance to upload the newer version to HuggingFace.
@iNeil77 here you go, Jon's tweet and blog has links: https://x.com/JonSaadFalcon/status/1792623213698232808
Hey, thanks for sharing this very interesting work!
I was interested in the recent LoCo benchmark composed for long-context retrieval and found it useful to have results for a very simple lexical baseline method first to put the scores in the blog post into context. As this was not yet done in the blog post, I ran BM25 (via ElasticSearch) on all benchmark tasks based on your eval script. Full results, in comparison to the best-performing M2-BERT-32768 (80M), below (NDCG@10 for all).
BM25
BM25 seems to be very competitive on LoCo, coming close to the best model tested in the post's evaluation and outperforming all other tested embedding models. Thus, lexical overlap between queries and correct documents seems to be very high on the benchmark tasks.
QMSum Analysis
Looking a bit closer at the results, we can see that for 4 of 5 tasks, NDCG is well above 90, meaning that BM25 is nearly perfectly able to retrieve the correct documents. The only exception is QMSum, so I looked into its data a bit closer:
Originally, QMSum is a summarization dataset consisting of three text fragments: a corpus of 232 long meeting transcript, a set of 272 questions and 272 query-based summarizations of the transcripts. In the tau/scrolls format, queries and transcripts are joined together in the "input" field whereas summaries are given in the "output" field. This gives 272 pairs of inputs-outputs. LoCo now simply uses "output" as query and "input" as document, giving 272 queries and 272 documents.
This means that in the LoCo doc corpus of QmSum multiple documents are based off the same long meeting transcript, paired with different questions. E.g. for the first 4 documents are:
The truncated part is identical in all four, meaning that the overwhelming part of the documents (with 9748 words on average) is identical apart from the question stated in the first few words. For distinguishing between these groups of documents, only the first few words are therefore relevant.
As an ablation, I removed the questions at the start of all documents and "merged" the resulting identical documents into one and then ran BM25 again. This improves NDCG@10 to 78.7.
Just wanted to share these quick insights into the LoCo benchmark, maybe this is useful to someone!