Request to evaluate the new O1 models by OpenAI (O1-preview and O1-mini)

Thank you for your suggestion! Due to the high access restrictions of the o1-preview, as well as the higher costs associated with the internal reasoning tokens, and the fact that the o1-preview does not currently support multimodal input, we have not yet tested the complete full set of results. However, we have tested a subset of the results (https://x.com/Z_Huang_02/status/1834634575345270898), which can still reflect some qualitative conclusions.

At the same time, considering that o1-preview introduces an additional internal reasoning process before answering, the fairness of directly comparing it with other models is still debatable.

GAIR-NLP / OlympicArena

Request to evaluate the new O1 models by OpenAI (O1-preview and O1-mini) #4