dvlab-research / LLaMA-VID

Official Implementation for LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Apache License 2.0
623 stars 40 forks source link

关于stage2的json文件 #22

Closed liziming5353 closed 6 months ago

liziming5353 commented 6 months ago

请问你们stage2做微调的时候是默认使用llava_v1_5_mix665k_with_video_chatgpt_maxtime_5min.json这个文件吗?如果是的话,为什么要把视频限制在5分钟以内呢?

yanwei-li commented 6 months ago

It's the default one for the model with vicuna 13B (the one for 7B without time limit), because we need to avoid the out-of-memory problem during training if the video is too long.

liziming5353 commented 6 months ago

请问有预估需要多少内存吗?如果我的内存够用,是不是就不用限制5分钟了呢

yanwei-li commented 6 months ago

Hi, for vicuna 7B, it requires <40G in my experiments, and 80G for vicuna 13B.

liziming5353 commented 6 months ago

Thanks! Also, may I ask how much loss can you get in stages 1, 2, and 3? I want to check if my training is correct. Now in stage 1 I get loss=2 and in stage 2 I get loss=1

yanwei-li commented 6 months ago

Hi, the loss for stages 1, 2, and 3 is about 1.9-2.1, 0.8-1.0, 1.3-1.5, respectively.

liziming5353 commented 6 months ago

Got it!