Open Belzedar94 opened 2 months ago
Thank you so much for your patience, and I apologize for the delayed response. We’ve been working hard to enhance our dataset, and the full evaluation set now includes 210 problems to reduce variance.
The results for this comprehensive version have been updated on our leaderboard. You can find the pass@1 scores for o1-preview and o1-mini using a greedy decoding strategy, which are 68.1% and 66.1%, respectively.
Thank you again for bearing with us!
Awesome results and awesome work guys!! Would love to see the performance of the new Sonnet and Haiku 3.5 models if that's possible.
Thanks a lot and keep rocking 🦾
We recently updated the results for the Claude 3.5 Sonnet and Haiku 1022 models. Due to budget constraints, we are unable to conduct pass@N experiments. The pass@1 scores for the 3.5 Sonnet 1022 and Haiku 1022 models are 48.1% and 40.5%, respectively.
While these scores are slightly below those of GPT-4o (51.4%) and GPT-4o mini (42.4%), Claude remains the best coding model available, aside from those from OpenAI.
Model introduction
New models by OpenAI with improved reasonimg capabilities.
Model URL (Optional)
No response
Additional information (Optional)
No response
Decontamination
No info
Author
No
Data
No
Security
Integrity