SparksofAGI / MHPP

https://sparksofagi.github.io/MHPP/
25 stars 0 forks source link

🤗 [REQUEST] - nemotron-4-340b-instruct #4

Closed wasiahmad closed 2 weeks ago

wasiahmad commented 1 month ago

Model introduction

The models created by NVIDIA and excellent at range of tasks. The math and coding capabilities seems not top notch compare to leading model in many benchmark but still curious about diverse test.

Model URL (Optional)

https://build.nvidia.com/nvidia/nemotron-4-340b-instruct

Additional information (Optional)

No response

Decontamination

In the technical report (https://arxiv.org/pdf/2406.11704), there is no information provided.

Author

No

Data

No

Security

Integrity

1e0ndavid commented 1 month ago

Hi, I just evaluated the pass@1 results for Nemotron-4-340B Instruct. Here are the detailed results: {'Total': 0.3, 'Distraction': 0.35, 'Redefinition': 0.45, 'Shortcut': 0.2, 'Commonsense': 0.25, 'Cornercase': 0.25, 'Complex': 0.1, 'Codesense': 0.5}

Basically, according to the HumanEval benchmark in Table 5 of NVIDIA's technical report, its coding ability is similar to that of Claude 3 Sonnet and is slightly better than that of Mistral Large.

Please also check our leaderboard for a more intuitive comparison, I just updated it.

wasiahmad commented 1 month ago

Since you evaluated an instruct model, just wanted to confirm, if the numbers are for 0-shot setup? If not, how you have prompted the model for generation?

1e0ndavid commented 1 month ago

Yes, similar to HumanEval, we always use a 0-shot setup. For the prompt, please check the MHPP.jsonl file in the data directory. We use the 'prompt' field within it to query models.

"prompt": "Write a Python function according to the function name and the problem description in the docstring below. \n\ndef table_tennis_results(marks: str) -> int:\n \"\"\"Adham Sharara was elected as the sixth President of the International Table Tennis Federation(ITTF) in 1999.\n Under his leadership,......"