Paper Writing: An overview issue

KennethEnevoldsen commented 5 months ago

This issue is an overview issue for paper writing. For full discussion of what needs to be done check out #784. The intention for this issue is to make it easier for contributors to find places to write on as well as for us to guide them in the right direction and keep an overview.

How to discuss these segments:

Keep discussion about specific segments to the overleaf (but feel free to ping us on github)
if you want to discuss a section related to one of the point open a issue and link it here

Writing Sections:

[ ] [on hold] Abstract: In need of a polish, but waiting for results
[x] #1004
[x] #1005
[ ] Methods
- [x] #1006
- [ ] [writing on hold] Constructing the benchmark: requires #837
- [x] #1007
- probably expanded by #836 and #835
[ ] [missing] Results (see #784 for overview)
- [ ] [missing] Models considered (see #879)
[ ] [missing] Discussion
[ ] [missing] Limitations
[ ] [need to be expanded] ethical consideration
[ ] [missing] acknowledgment
[ ] Appendix
- [x] #1009
- [x] #1008
- [ ] [work needed] B3 New dataset: should be moved into task construction, stilling missing notable datasets
- [x] #897
- [ ] WikipediaRetrieval (@rasdani can I ask you to take this one) (#718)
- [ ] C Speeding up the benchmark (full details)
- [ ] [missing] E Full results table: might be worth not having it, but refer to a (anonymized) repository
- [x] F: New metrics (#895)

Other concerns

[ ] Ensure that MMTEB is consistently named across
[ ] Overview figure: @x-tabdeveloping will upload the latest version to github + overleaf
[ ] Do we need something on cross-lingual tasks?

gentaiscool commented 5 months ago

Hi @KennethEnevoldsen, thanks for the effort in organizing the paper overview. I'd like to assist in completing the related work section by incorporating recent papers to enhance its relevance. I agree that we need paraphrasing the initial segment and adding more distinct aspects to set our work apart from existing research. Additionally, I am aware of several large-scale collaborative projects that could be referenced in our paper to make the related work section more comprehensive. And, I was wondering to know on how we determine contribution points for paper writing. I am happy in general to help writing in any sections if needed.

KennethEnevoldsen commented 5 months ago

Sounds wonderful I would be very happy if had the time to go over those sections. Feel free to ping me once you have done so.

Generally, we add points based on relative effort. Since most contributors have added datasets before, they have approximately encoded a points-to-effort ratio. We have the writer suggest points, and then, of course, we can discuss if it makes sense afterward.

This is of course, not a perfect system (but it is always hard to quantify contributions)

gentaiscool commented 4 months ago

Thank you, @KennethEnevoldsen, for the explanation. I will review the entire paper and focus on the sections where I can contribute, particularly those that don't require waiting for experimental results.

isaac-chung commented 4 months ago

Not sure if we had discussed this before: would any of the language family groupings e.g. in https://github.com/embeddings-benchmark/mteb/issues/366 have a place in the paper? or would that require https://github.com/embeddings-benchmark/mteb/issues/837 to be completed first?

MariyaTikhonova commented 4 months ago

Hi @KennethEnevoldsen, thanks for the effort in organizing the paper overview.

My colleagues and I, we'd like to help you with the paper writing, if our help is appreciated.

1) We'd like to assist in completing the limitations and ethical consideration, if it is still actual.

2) Besides, we could add basic information about the Russian-language datasets we contributed to MTEB, if needed. We could also provide model evaluation we carried out not long ago.

3) On the final stages we could also contribute to the general paper correction (small typos, uniform model naming, etc.)

KennethEnevoldsen commented 4 months ago

@MariyaTikhonova

1) Sounds great

2) Can you go over section B. If you have created dataset for the benchmarks then please add that to B3. You might create a new appendix on Benchmark Creation and describe the curation rationale for the Russian benchmark. For now results are not needed, but might be added in the future.

3) Sounds lovely as well. I would go for 1 and 2 to start with.

gowitheflow-1998 commented 4 months ago

hi @KennethEnevoldsen, let me know if you need me to add information of RAR-b tasks to the paper and anything I can help with the paper writing in general!

KennethEnevoldsen commented 4 months ago

@gowitheflow-1998 can I ask you to add a section in appendix B4?

gowitheflow-1998 commented 4 months ago

@KennethEnevoldsen Sure. Will do today!

mariyahendriksen commented 3 months ago

hi everyone, I am done with the introduction part of the paper. I will start going over the remaining parts sequentially. Please let me know if there is any section/aspect I should pay additional attention to!

mariyahendriksen commented 1 month ago

Hi all,

(cc @KennethEnevoldsen, @isaac-chung, @imenelydiaker)

Now that the paper has been submitted, should we consider posting it on arXiv? ICLR’s double-blind submission policy, similar to other major ML conferences, allows for preprints to be shared on arXiv.

Publishing the paper on arXiv could help with wider dissemination and potentially save us more than four months, which is especially important given how fast-paced the ML field is. Additionally, if reviewers suggest changes during the rebuttal phase, we can always update the arXiv version.

Let me know your thoughts! I’d be happy to assist with the process if we decide to move forward.

isaac-chung commented 1 month ago

I'm onboard with what Mariya suggested. For those who are curious it's under the "dual submission policy" https://iclr.cc/Conferences/2025/CallForPapers . In the double blind reviewing section: "Having papers on arxiv is allowed per the dual submission policy outlined below."

KennethEnevoldsen commented 1 month ago

I completely agree, the hope is to have the leaderboard up and running before we publish the arxiv paper to have the highest possible impact on release. Let me know what you think about re. this?

imenelydiaker commented 1 month ago

I completely agree, the hope is to have the leaderboard up and running before we publish the arxiv paper to have the highest possible impact on release. Let me know what you think about re. this?

I think you can push it to arxiv before the leaderboard is up. I'm not sure we'll integrate screenshots of the leaderboard in the paper anyway, right ? Once the LB is ready, we can push twitter threads and linkedin posts about the paper.

mariyahendriksen commented 1 month ago

I completely agree, the hope is to have the leaderboard up and running before we publish the arxiv paper to have the highest possible impact on release. Let me know what you think about re. this?

I think you can push it to arxiv before the leaderboard is up. I'm not sure we'll integrate screenshots of the leaderboard in the apper anyway, right ? Once the LB is ready, we can push twitter threads and linkedin posts about the paper.

Makes sense to me as well.

Posting the paper on arXiv could take up to a week, given the high submission volume. I’m happy to handle the process of getting the paper arXiv-ready and, once we have everyone’s approval, I can submit it. I recently went through the same process for another paper under review, so it’s still fresh in my mind. That said, if someone else prefers to manage this, I’m equally happy to pass it on!

Let me know what you think!

KennethEnevoldsen commented 1 month ago

Thanks @mariyahendriksen. I think most of the stuff that needs to be done is on my end (eg final author list) - I agree that it would be it would be nice to have it available online as soon as possible.

@Muennighoff wdyt? Should we also include some additional models?

Muennighoff commented 1 month ago

Great points; I think having the leaderboard ready first and also adding a few more models and then doing one social media push upon release would maximize impact. (I think there's a very low risk of getting "scooped" here in case people are worried about that)

@KennethEnevoldsen which models from the ones we discussed should I still run? I think some APIs i.e. voyage, openai etc would be great - I will ask them for credits.

KennethEnevoldsen commented 1 month ago

I def. think the commercial API's voyage, cohere, OpenAI.

I was also thinking about moving this up to the main paper: Screenshot 2024-10-01 at 22 59 26

Potentially with some edits (e.g. add individual points)

Muennighoff commented 1 month ago

Okay will look into running them!

I think the plot is great though maybe it would benefit from

Ordering the legend models in the same way as the lines (not fully the case on the right one I think e.g. light blue is in a different spot)
It looks a bit like it is only the 4 languages depicted, maybe indicating the total number of languages plotted in the caption or adding it to the plot (e.g. a line marking top 10 or top 100 ; I guess adding individual points would also solve this)
Maybe making the vertical language lines dashed instead? (Since it is a moving average whatever appears before the vertical lines still impacts the borda afaik now, but the solid lines make it look a bit like a hard reset I think)

Muennighoff commented 3 weeks ago

If someone has bandwidth to estimate the amount of credits from OpenAI we'd need, that'd be super useful. I think they're willing to sponsor, we just need to provide an estimate!

KennethEnevoldsen commented 2 weeks ago

@Muennighoff something like this might work:

benchmarks = mteb.get_benchmarks()

total_characters = 0

for benchmark in benchmarks:
    n_characters = 0
    for task in benchmark.tasks:
        try:
            desc_stats = task.metadata.descriptive_stats

            for split in desc_stats["n_samples"]:
                n_samples = desc_stats["n_samples"][split]
                avg_char_leng = desc_stats["avg_character_length"][split]

                if task.metadata.type == "Retrieval":
                    n_characters = (
                        avg_char_leng["average_document_length"]
                        * avg_char_leng["num_documents"]
                        + avg_char_leng["average_query_length"]
                        * avg_char_leng["num_queries"]
                    )
                else:
                    n_characters += n_samples * avg_char_leng
        except Exception as e:
            print(f"Missing/incomplete descriptive stats for {task.metadata.name}: {e}")

    print(f"{benchmark.name}: {n_characters:,} characters")

    total_characters += n_characters

print(f"Total characters: {total_characters:,}")

Sadly we have a lot of incomplete descriptive_stats so currently the numbers are probably quite far off

Muennighoff commented 2 weeks ago

Great I got 3701778834.0939293 characters from that! Should correspond to ~925444708.5234823 tokens (divided by 4) so around 1B tokens (though maybe more like 10B as some are missing) - Maybe useful to put the final character/token count or other inference stats in the paper 🤔

Muennighoff commented 1 week ago

I added text-embedding-3-small results here: https://github.com/embeddings-benchmark/results/pull/40 I will run text-embedding-3-large now, but would be interesting to already check if the results make sense and how it ranks vs the other models on MMTEB

KennethEnevoldsen commented 1 week ago

Will look at getting it merged in then we can look at it on the new leaderboard

KennethEnevoldsen commented 5 days ago

Closing this in favor of #1405

embeddings-benchmark / mteb

Paper Writing: An overview issue #896

Writing Sections:

Other concerns