SparksofAGI / MHPP

https://sparksofagi.github.io/MHPP/
26 stars 0 forks source link

🤗 [REQUEST] - OpenAI new O1 models (O1-preview and O1-mini) #5

Open Belzedar94 opened 2 months ago

Belzedar94 commented 2 months ago

Model introduction

New models by OpenAI with improved reasonimg capabilities.

Model URL (Optional)

No response

Additional information (Optional)

No response

Decontamination

No info

Author

No

Data

No

Security

Integrity

1e0ndavid commented 3 weeks ago

Thank you so much for your patience, and I apologize for the delayed response. We’ve been working hard to enhance our dataset, and the full evaluation set now includes 210 problems to reduce variance.

The results for this comprehensive version have been updated on our leaderboard. You can find the pass@1 scores for o1-preview and o1-mini using a greedy decoding strategy, which are 68.1% and 66.1%, respectively.

Thank you again for bearing with us!

Belzedar94 commented 3 weeks ago

Awesome results and awesome work guys!! Would love to see the performance of the new Sonnet and Haiku 3.5 models if that's possible.

Thanks a lot and keep rocking 🦾

Cartus commented 2 weeks ago

We recently updated the results for the Claude 3.5 Sonnet and Haiku 1022 models. Due to budget constraints, we are unable to conduct pass@N experiments. The pass@1 scores for the 3.5 Sonnet 1022 and Haiku 1022 models are 48.1% and 40.5%, respectively.

While these scores are slightly below those of GPT-4o (51.4%) and GPT-4o mini (42.4%), Claude remains the best coding model available, aside from those from OpenAI.