I have some confusion regarding the evaluation process. In this file, it appears that you use ChatGPT-turbo-0327 as the default setting. However, in this script, a different version is specified.
Since the evaluation results vary significantly between versions (by about 10%), I am curious which version is ultimately used. I attempted to reproduce the results by inference alone and found that I only achieved scores similar to your report when using ChatGPT-turbo-0327, not ChatGPT-turbo-0613.
Could you please clarify which version is used for the final evaluation?
Thank you for your excellent work!
I have some confusion regarding the evaluation process. In this file, it appears that you use ChatGPT-turbo-0327 as the default setting. However, in this script, a different version is specified.
Since the evaluation results vary significantly between versions (by about 10%), I am curious which version is ultimately used. I attempted to reproduce the results by inference alone and found that I only achieved scores similar to your report when using ChatGPT-turbo-0327, not ChatGPT-turbo-0613.
Could you please clarify which version is used for the final evaluation?
Thank you!