OpenGVLab / InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
https://internvl.readthedocs.io/en/latest/
MIT License
5.54k stars 432 forks source link

InternVL2-40B的json结构化输出能力为啥还不如 InternVL2-26B的输出能力强呀? #591

Open gehong-coder opened 2 days ago

gehong-coder commented 2 days ago

Checklist

Describe the bug

我使用InternVL2-40B的模型输出的结果总是会出现中文的字符,类似于下面 { "description": "The video features a woman performing a series of sit-ups on a black yoga mat in a minimalist room. She is wearing a white tank top and blue leggings, with her hair neatly tied back. The room has a white wall, a potted plant on the left, and a wooden bench with a green yoga mat and a basket on the right. The woman starts by lying on her back with her arms extended, then gradually lifts her upper body off the mat, engaging her core muscles. The video includes a countdown timer in the upper right corner, starting from 10 and decreasing by one number with each repetition.", 'camera_motion': 'static', ‘content_category’: ‘human actions’, ’VFX’: "" } 而使用 26B 的模型输出的基本都是json结构化完整的,为啥会有这样的区别呀?

Reproduction

官方的脚本 生成参数如下: generation_config = dict( max_new_tokens=1024, do_sample=True, temperature=0.75, min_length=15, no_repeat_ngram_size=3, top_p= 0.7 )

Environment

两个是同一个环境

Error traceback

No response

gehong-coder commented 2 days ago

我的指令为 Describe the video, including a comprehensive description (The description should enable the AI to accurately recreate the video), camera motion (e.g., pan left, zoom in, tilt up, tracking shots), content category (e.g., nature scenery, human actions, animals actions), and any visual effects (VFX), if the video contains effects, describe them briefly in one sentence, else keep the VFX field as an empty string. Output the response in the following JSON format: { "description": "", "camera_motion": "", "content_category": "", "VFX": "" }