embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.79k stars 238 forks source link

More code search benchmarks #614

Open jordane95 opened 4 months ago

jordane95 commented 4 months ago

Hi,

I find that many new code retrieval benchmarks are not included in this repo, with only CodeSearchNet back in 2019 being added recently. Recent work on code retrieval often uses more challenging and extensive evaluation benchmarks with queries mined from real world user problems. Do we have plan for adding new benchmarks currently used by the code search community?

Ref:

  1. https://arxiv.org/abs/2403.16702
  2. https://arxiv.org/abs/2201.10866
KennethEnevoldsen commented 4 months ago

@jordane95 atm. we are currently working on extending to a broad set of languages (see mmteb), though we have had submissions that are domains-specific (including code). We highly encourage these, but it is not the primary focus atm., though if you have the time we would love to review PRs related to these and answer any questions related to implementation you might have. Otherwise I have marked this issue as "good first issue" to encourage new contributors to pick it up.