InternVL2-40B的json结构化输出能力为啥还不如 InternVL2-26B的输出能力强呀？

Checklist

[ ] 1. I have searched related issues but cannot get the expected help.
[ ] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

我使用InternVL2-40B的模型输出的结果总是会出现中文的字符，类似于下面 { "description": "The video features a woman performing a series of sit-ups on a black yoga mat in a minimalist room. She is wearing a white tank top and blue leggings, with her hair neatly tied back. The room has a white wall, a potted plant on the left, and a wooden bench with a green yoga mat and a basket on the right. The woman starts by lying on her back with her arms extended, then gradually lifts her upper body off the mat, engaging her core muscles. The video includes a countdown timer in the upper right corner, starting from 10 and decreasing by one number with each repetition.", 'camera_motion': 'static', ‘content_category’: ‘human actions’, ’VFX’: "" } 而使用 26B 的模型输出的基本都是json结构化完整的，为啥会有这样的区别呀？

Reproduction

官方的脚本生成参数如下： generation_config = dict( max_new_tokens=1024, do_sample=True, temperature=0.75, min_length=15, no_repeat_ngram_size=3, top_p= 0.7 )

Environment

两个是同一个环境

Error traceback

No response

OpenGVLab / InternVL