Vision-CAIR / MiniGPT4-video

Official code for Goldfish model for long video understanding and MiniGPT4-video for short video understanding
https://vision-cair.github.io/Goldfish_website/
BSD 3-Clause "New" or "Revised" License
559 stars 60 forks source link

difference on model parameters between training and inference #15

Open Minyoung1005 opened 7 months ago

Minyoung1005 commented 7 months ago

Hi, thanks for your cool work!

I've trained the llama2 minigpt4-video model on ~500 short videos using stage 3 finetuning scripts, and the training loss converged to almost 0 (< 1e-4).

The training seems to be stable, the default temperature seems to be 1.0 for both training and inference, but whenever I evaluate on the training dataset, I get a very low accuracy. Ideally, I was expecting the model to return almost 100% accuracy since the model should be overfitted to the training data.

For instance, my training dataset only contains the answers as "1" or "0" (as string), but during inference, the model often outputs long natural language instruction as response.

Do you have any idea to solve this issue? Is there any difference on the model parameters between training and inference that I should take care of?

Thanks

KerolosAtef commented 7 months ago

Hello @Minyoung1005 thank you for your interest in our work. I have a question, for the long output response, Is it correct response but descriptive. or it is wrong and hallucinations.

Minyoung1005 commented 7 months ago

Hello @KerolosAtef ! Long output responses are mostly descriptive, often including that the answer is 1 or 0. Fyi, I also get lots of empty string outputs. Here are some examples of undesirable outputs:

""

"The task is solved. The path of the robot changes color from white to green as it moves up and down, repeating the sequence 4 times. Therefore, the last position of the trajectory is red, which indicates that the final image shows a red dot in this location. So, the answer is (1)."

"The task is not solved."

"The final image of the robot trajectory is a line that changes color from white to green as time goes on. The red dot shows the last position of the trajectory, which means it moves horizontally to the right making circles"

"\u0409"

"Based on the provided image and video, I can confidently say that the task was solved by the robot. The robot's path is shown as a line that changes color from white to green as time goes on, indicating successful completion of the requested sequence of rotations (9 degrees left followed by reversal back to -90 degrees). Additionally, the red dot at the end of the trajectory confirms that the last position of the robots movement matches one of rotation fulfillment\nTherefore ,I would express my answeras:1"

KerolosAtef commented 6 months ago

Hello @Minyoung1005 I think you for the training could you try to make your answers in the form of natural language such as "In this video the correct answer is 1" , "Based on the video content the answer will be 1" , and so on. Also make the questions in the same way (Different questions not only one while training).

For inference you can force the output format while prompting the model such as : {your question} , please make the output only one word as "1" in case of correct answer and "0" for wrong question