claude-3-opus-20240229 results

evalplus / repoqa

RepoQA: Evaluating Long-Context Code Understanding

https://evalplus.github.io/repoqa.html

Apache License 2.0

96 stars 3 forks source link

Closed scf4 closed 4 months ago

scf4 commented 4 months ago

ganler commented 4 months ago

Thank you @scf4! We just updated the leaderboard in the same time! Claude3 aced the retrieval task!

ganler commented 4 months ago

scf4 commented 4 months ago

Thanks, although our results are different (enough to overlap with Sonnet). Are you doing multiple runs per model?

ganler commented 4 months ago

Hi, I was doing greedy decoding (temp = 0). It can be possible that Claude controls some randomness.

scf4 commented 4 months ago

Hey, yeah the OAI/Anthropic APIs have never been deterministic even with temperature of 0!

scf4 commented 4 months ago

@ganler Given the margins here could you put a note on the rankings page about this?

I'd suggest averaging scores of multiple runs, but both OAI and Anthropic can return the same responses in succession (not sure if caching or something else).