evalplus / repoqa

RepoQA: Evaluating Long-Context Code Understanding
https://evalplus.github.io/repoqa.html
Apache License 2.0
96 stars 3 forks source link

claude-3-opus-20240229 results #36

Closed scf4 closed 4 months ago

scf4 commented 4 months ago

opus-results.zip

Screenshot 2024-04-28 at 20 43 11
ganler commented 4 months ago

Thank you @scf4! We just updated the leaderboard in the same time! Claude3 aced the retrieval task!

ganler commented 4 months ago

Check it out at https://evalplus.github.io/repoqa.html

scf4 commented 4 months ago

Thanks, although our results are different (enough to overlap with Sonnet). Are you doing multiple runs per model?

ganler commented 4 months ago

Hi, I was doing greedy decoding (temp = 0). It can be possible that Claude controls some randomness.

scf4 commented 4 months ago

Hey, yeah the OAI/Anthropic APIs have never been deterministic even with temperature of 0!

scf4 commented 4 months ago

@ganler Given the margins here could you put a note on the rankings page about this?

I'd suggest averaging scores of multiple runs, but both OAI and Anthropic can return the same responses in succession (not sure if caching or something else).