Closed wasiahmad closed 2 weeks ago
Hi, I just evaluated the pass@1 results for Nemotron-4-340B Instruct. Here are the detailed results: {'Total': 0.3, 'Distraction': 0.35, 'Redefinition': 0.45, 'Shortcut': 0.2, 'Commonsense': 0.25, 'Cornercase': 0.25, 'Complex': 0.1, 'Codesense': 0.5}
Basically, according to the HumanEval benchmark in Table 5 of NVIDIA's technical report, its coding ability is similar to that of Claude 3 Sonnet and is slightly better than that of Mistral Large.
Please also check our leaderboard for a more intuitive comparison, I just updated it.
Since you evaluated an instruct model, just wanted to confirm, if the numbers are for 0-shot setup? If not, how you have prompted the model for generation?
Yes, similar to HumanEval, we always use a 0-shot setup. For the prompt, please check the MHPP.jsonl file in the data directory. We use the 'prompt' field within it to query models.
"prompt": "Write a Python function according to the function name and the problem description in the docstring below. \n\ndef table_tennis_results(marks: str) -> int:\n \"\"\"Adham Sharara was elected as the sixth President of the International Table Tennis Federation(ITTF) in 1999.\n Under his leadership,......"
Model introduction
The models created by NVIDIA and excellent at range of tasks. The math and coding capabilities seems not top notch compare to leading model in many benchmark but still curious about diverse test.
Model URL (Optional)
https://build.nvidia.com/nvidia/nemotron-4-340b-instruct
Additional information (Optional)
No response
Decontamination
In the technical report (https://arxiv.org/pdf/2406.11704), there is no information provided.
Author
No
Data
No
Security
Integrity