Incomplete evaluation on MSVD-QA dataset.

dvlab-research / LLaMA-VID

Official Implementation for LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

Apache License 2.0

622 stars 39 forks source link

Incomplete evaluation on MSVD-QA dataset. #52

Open XenonLamb opened 5 months ago

XenonLamb commented 5 months ago

Hi! I'm trying to reproduce the video evaluation results for llama-vid-7b-full-224-video-fps-1, but after running the provided scripts with the official checkpoint and MSVD-QA, not all of the files are predicted: 20240109-175304 What could be the cause of it?

XenonLamb commented 5 months ago

To provide some context, here is the result file I obtained after running the evaluation script: results (2).json

yanwei-li commented 5 months ago

Hi, this happens when GPT doesn't give feedback. It may caused by network issues or GPT response issues. You can first check the network and ensure GPT works well, and then run the evaluation script (L27-L34 at here) again. It will continue the incomplete files.

XenonLamb commented 5 months ago

Thank you! May I ask which api_base did you use for evaluation? I found GPT's behavior seems different for gpt-3.5-turbo on my api base, which caused about 7% difference in accuracy

yanwei-li commented 5 months ago

Hi, we use the bought api base. We tested several times and did not find such a huge gap. Are other packages kept the same, like transformers?

XenonLamb commented 4 months ago

Hi, we use the bought api base. We tested several times and did not find such a huge gap. Are other packages kept the same, like transformers?

yes, the other packages are the same. the accuracies of (a) The results.json you provided in another issue (b) The results predicted from provided checkpoints and (c) results predicted from re-implemented model are very close.