dvlab-research / MR-GSM8K

Challenge LLMs to Reason About Reasoning: A Benchmark to Unveil Cognitive Depth in LLMs
MIT License
37 stars 0 forks source link

the new version #2

Open Shijie-Xia opened 7 months ago

Shijie-Xia commented 7 months ago

Hi! When will the dataset become stable? My paper, submitted to an upcoming AI conference, used the dataset. However, today I learned that it has been renamed and cleaned. Could you also update the results of the GPT-4 evaluation? I'm too lazy to test it myself. LOL

Randolph-zeng commented 7 months ago

Hi, thanks for the interest in the dataset! The new version mainly introduced a customized metrics called MR-Score to unify the three metrics in the three sub-tasks( solution correctness, error step, error reason). I have also cleaned up a little bit of the evaluation results. The update should be reflected in the arxiv soon. However there has not been much change in the dataset except for the renaming (the renaming is more for consistency consideration for the future expansion to more difficult datasets). Therefore, please rest in sure that there won't be any major updates on the dataset ( like re-naming, big cleaning etc) in the near future.

I will leave this issue open and update the readme about this at the same time in case there is any potential confusion and cocerns. Thanks !

Shijie-Xia commented 7 months ago

Thank you for your patience! I am confident that the MR-GSM8K will play a crucial role in advancing AGI. I have a question regarding the results mentioned in the README file: does GPT3.5 refer to gpt-3.5-turbo-1106, and GPT4 refer to gpt-4-1106-preview? I want to ensure accurate version descriptions when using your data. See the OpenAI website.

Randolph-zeng commented 7 months ago

Hi shijie, thanks for your kind words! Regarding your question, they are outlined in the section 4.1 of the paper. The APIs I am using are d GPT3-5-turbo0613, Claude2.0, GPT4-0613 since the experiment was conducted before November and the turbo-1106 was not available at that time. However, the auto eval is indeed utilizing the latest turbo-1106 version.

Btw, the paper is updated in the arxiv already in case you would love to check it out, the auto eval is discussed more extensively in appendix B : ) https://arxiv.org/pdf/2312.17080.pdf

Shijie-Xia commented 5 months ago

Thank you for your reminder! I want to cite your paper and use the BibTeX format you provided, but I noticed that it doesn't show the arXiv identifier in the reference. Perhaps this is because you uploaded two versions and changed the name? Regardless, I used the reference provided in your GitHub repository.

Randolph-zeng commented 5 months ago

Hi Shijie: Thank you for your kind feedback ! It is really kind of you to raise this issue to us. I double checked my BibTex and consulted with GPT4 regarding the renaming issue, it seems to me that as long as the eprint field is correctly set ( e.g. arXiv:2312.17080) then you should be able to reference the paper just fine. However, it might really depends on your citation style that you are using. I just updated the BibTex in the readme and it seems work fine under ACL template. Do you mind taking a second look to see if the latest update works for you ? Thanks a lot again for your kind support and wish you have a nice day!

Shijie-Xia commented 5 months ago

Wow, thanks for your timely feedback. The new BibTex works with my template! I will update it.

Shijie-Xia commented 5 months ago

Oh! I've noticed one thing. Did you miss an author in the new version? There are four authors for the new version but five for the previous one. I've added the missing author directly because I must submit the preprint version of my paper before 2:00 AM Beijing time for it to be published today!

Shijie-Xia commented 5 months ago
屏幕截图 2024-04-09 004802

This is the version in my paper. I have checked all the references twice to ensure there are no mistakes. I hope this is correct for you.

Randolph-zeng commented 5 months ago

Yes, this looks perfect to me! Wish the best for your paper. May you have a smooth reviewing in any conference you submit. Good luck and thanks for your feedback : )