ASSERT-KTH / repairbench

0 stars 0 forks source link

elo for repair #3

Open andre15silva opened 3 days ago

andre15silva commented 3 days ago

I believe so (not an export on elo system tho)

We would need:

  1. A couple of magic numbers (starting elo, and scaling parameters for the influence of each match and computing the expected outcome)
  2. Define a winning criteria (e.g., given 10 non-deterministic answers, which generates more correct patches)
  3. Simulate matches until convergence

When a new LLM is added, we run the simulation again (on top of the existing results potentially).

andre15silva commented 3 days ago

https://lmsys.org/blog/2023-05-03-arena/