LoCo Benchmark - BM25 & Insights

calpt commented 9 months ago

Hey, thanks for sharing this very interesting work!

I was interested in the recent LoCo benchmark composed for long-context retrieval and found it useful to have results for a very simple lexical baseline method first to put the scores in the blog post into context. As this was not yet done in the blog post, I ran BM25 (via ElasticSearch) on all benchmark tasks based on your eval script. Full results, in comparison to the best-performing M2-BERT-32768 (80M), below (NDCG@10 for all).

BM25

Retrieval Encoders	Tau Scrolls Summ. Screen	Tau Scrolls Gov. Report	Tau Scrolls QMSUM	QASPER - Title to Article	QASPER - Abstract to Article	Average
BM25	97.4	98.7	59.4	94.0	99.4	89.8
M2-BERT-32768 (80M)	98.6	98.5	69.5	97.4	98.7	92.5

BM25 seems to be very competitive on LoCo, coming close to the best model tested in the post's evaluation and outperforming all other tested embedding models. Thus, lexical overlap between queries and correct documents seems to be very high on the benchmark tasks.

QMSum Analysis

Looking a bit closer at the results, we can see that for 4 of 5 tasks, NDCG is well above 90, meaning that BM25 is nearly perfectly able to retrieve the correct documents. The only exception is QMSum, so I looked into its data a bit closer:

Originally, QMSum is a summarization dataset consisting of three text fragments: a corpus of 232 long meeting transcript, a set of 272 questions and 272 query-based summarizations of the transcripts. In the tau/scrolls format, queries and transcripts are joined together in the "input" field whereas summaries are given in the "output" field. This gives 272 pairs of inputs-outputs. LoCo now simply uses "output" as query and "input" as document, giving 272 queries and 272 documents.

This means that in the LoCo doc corpus of QmSum multiple documents are based off the same long meeting transcript, paired with different questions. E.g. for the first 4 documents are:

Passage_0 -> What was agreed upon on sample transcripts? Professor E: So . OK . Doesn't look like it crashed . That's great ...
Passage_1 -> What was said on speech overlap? Professor E: So . OK . Doesn't look like it crashed . That's great ...
Passage_2 -> What’s the current status of recordings and transcriptions? Professor E: So . OK . Doesn't look like it crased. That's great ...
Passage_3 -> What was the future of data collection? Professor E: So . OK . Doesn't look like it crashed . That 's great ...

The truncated part is identical in all four, meaning that the overwhelming part of the documents (with 9748 words on average) is identical apart from the question stated in the first few words. For distinguishing between these groups of documents, only the first few words are therefore relevant.

As an ablation, I removed the questions at the start of all documents and "merged" the resulting identical documents into one and then ran BM25 again. This improves NDCG@10 to 78.7.

Just wanted to share these quick insights into the LoCo benchmark, maybe this is useful to someone!

DanFu09 commented 9 months ago

Interesting, this is a really great analysis! We also noticed this and have been working on an update to the benchmark (LoCoV1). We haven't put it out yet but will do soon (and add this as a great baseline).

CC @jonsaadfalcon

jonsaadfalcon commented 9 months ago

Thank you for sharing @calpt! If you have an evaluation script for BM25 available, I'd love to take a look and try it out on our new evaluation datasets.

DanFu09 commented 9 months ago

+1, would love to see the script @calpt! The scores a a good bit higher than when we ran BM25 internally so would love to see if we did something wrong!

calpt commented 9 months ago

Sure, I basically just took your loco_eval.py script, removed everything but the data loading, plugged in the BM25 implementation & eval of BEIR (roughly like this: https://gist.github.com/calpt/56d0d47724a061c4a7bd4a9a8fd990d2) and spun up a local ES docker container.

Looking forward to LoCo v1!

DanFu09 commented 9 months ago

Great, we’ll take a look! CC @jonsaadfalcon

On Thu, Feb 8, 2024 at 10:08 AM calpt @.***> wrote:

Sure, I basically just took your loco_eval.py script, removed everything but the data loading, plugged in the BM25 implementation & eval of BEIR (roughly like this: https://gist.github.com/calpt/56d0d47724a061c4a7bd4a9a8fd990d2) and spun up a local ES docker container https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started.html .

Looking forward to LoCo v1!

— Reply to this email directly, view it on GitHub https://github.com/HazyResearch/m2/issues/23#issuecomment-1934678100, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDDIITWTPU3RZTC6YL5P2DYSUIAXAVCNFSM6AAAAABC6KVFQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZUGY3TQMJQGA . You are receiving this because you commented.Message ID: @.***>

mahjongmen commented 7 months ago

Hey @DanFu09 would love to know if you have an update on this!

Our team, at Cohere, will likely report on an adjusted version of QMSum (What @calpt described above)

DanFu09 commented 7 months ago

Hi Elliott, thanks for the interest!

We have an updated LoCoV1 described in the arXiv ( https://arxiv.org/abs/2402.07440v2) - will have it on HF with updated checkpoints soon (we ran into ICLR rebuttals before we got a chance to clean it up for upload).

If you DM/email me and Jon we can try to share access to the private HF dataset?

On Thu, Mar 28, 2024 at 9:57 AM Elliott Choi @.***> wrote:

Hey @DanFu09 https://github.com/DanFu09 would love to know if you have an update on this!

Our team, at Cohere, will likely report on an adjusted version of QMSum (What @calpt https://github.com/calpt described above)

— Reply to this email directly, view it on GitHub https://github.com/HazyResearch/m2/issues/23#issuecomment-2025250955, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDDIIW733ORKRYBEDX6FO3Y2QOUPAVCNFSM6AAAAABC6KVFQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRVGI2TAOJVGU . You are receiving this because you were mentioned.Message ID: @.***>

iNeil77 commented 5 months ago

Hello @DanFu09! I found this benchmark quite exciting and was wondering if you got the chance to upload the newer version to HuggingFace.

DanFu09 commented 5 months ago

@iNeil77 here you go, Jon's tweet and blog has links: https://x.com/JonSaadFalcon/status/1792623213698232808

HazyResearch / m2

LoCo Benchmark - BM25 & Insights #23

BM25

QMSum Analysis