Closed scf4 closed 4 months ago
Thank you @scf4! We just updated the leaderboard in the same time! Claude3 aced the retrieval task!
Check it out at https://evalplus.github.io/repoqa.html
Thanks, although our results are different (enough to overlap with Sonnet). Are you doing multiple runs per model?
Hi, I was doing greedy decoding (temp = 0). It can be possible that Claude controls some randomness.
Hey, yeah the OAI/Anthropic APIs have never been deterministic even with temperature of 0!
@ganler Given the margins here could you put a note on the rankings page about this?
I'd suggest averaging scores of multiple runs, but both OAI and Anthropic can return the same responses in succession (not sure if caching or something else).
opus-results.zip