Closed YiwuZhong closed 10 hours ago
I'm sorry to hear that you're experiencing difficulties. I'm not sure why you're unable to reproduce the result by running finetune_video_image.slurm. Could you please double-check the training data? Also, ensure that you are adhering closely to our settings.
Regarding the pretrained checkpoint, we have not yet released the model trained on the full videochat2 set, i.e., VideoLLaMB\beta as shown in Table 3. The model we have made available is VideoLLaMB\alpha, which is the default setting for all our other experiments. As reported in our paper, this model achieved a result of 49.33 on MVBench.
Thanks for your quick reply.
Yes, I've checked that the image and video data are intact, by going through json annotations and local files. I was wondering what should be the final model LlavaLlamaForCausalLM
vs. LlavaLlamaForCausalLMRMT
, and which should be the final script finetune_video_image.slurm
vs. finetune_video_image_l.slurm
. Also, have you tried using the released default code to reproduce the paper result, such as 52.5 on MVBench?
In fact, I simply copied the code into a new repository to ensure clarity. By the way, LlavaLlamaForCausalLMRMT is a legacy class that we have not utilized in the current version of our work.
Thank you for the clarification. Also, I wanted to make sure that script to reproduce paper results is finetune_video_image.slurm
, instead of finetune_video_image_l.slurm
.
I suspect there's a typographical error here; you are encouraged to use finetune_video_image_l.slurm. However, there is no significant difference between these two scripts.
Just update the script:
finetune_video_image.slurm
will call LlavaLlamaForCausalLM
finetune_video_image_loss.slurm
will call LlavaLlamaForCausalLMRMT
there is no significant difference between these two settings
I'm still unsure why you're unable to reproduce the result. However, there are a couple of hyperparameters you could adjust: in llava/model/multimodal_projector/rmt_r_transformer_projector.py
at line 350, try changing the number of scene segments to 1 for MVBench, as the videos there are relatively short.
@YiwuZhong I just fixed a bug in the memory cache and carefully re-evaluated the results. Could you please try again? If the results are still abnormally low, please don't hesitate to let me know.
@patrick-tssn Thanks for your information. One more hyperparameter: What is the batch size (the total batch size, instead of per GPU batch size) and paired learning rate? The script is different from the paper, due to the number of GPUs.
finetune_video_image.slurm
: the total batchsize is 32, and the paired learning rate is 2e-5
finetune_video_image_loss.slurm
: the total batchsize is 16, and the paired learning rate is 2e-5
@patrick-tssn The updated code and script are not able to improve the results.
Specifically, setting the number of scene segments to 1 only improves performance by 0.8 on MVBench. However, the model trained by finetune_video_image.slurm
is still lower than the paper result by around 4 points. It'd be great if you could try training a model using the current version of code. Thanks!
One more technical question: The video encoder comes from LanguageBind. Have you tried using image encoder to encode each frame individually, instead of the video encoder?
It's quite strange; in my experiments, 49.33 serves as the lower bound on MVBench. Are you utilizing the dataset from https://huggingface.co/datasets/ColorfulAI/VideoLLaMB-IT/tree/main
I have only four A800; rerunning the code could take some time.
I've tested the clip-vit but haven't observed any notable enhancements so far on EgoSchema.
Yes, I used the filtered annotations in that link. In total, there are 625K image samples, 367K video samples, and 41K language samples.
@YiwuZhong Could you please provide the config.json file from your checkpoint?
{ "X": [ "VIDEO", "IMAGE" ], "_name_or_path": "./checkpoints/llava-v1.5-7b", "architectures": [ "LlavaLlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 1, "eos_token_id": 2, "freeze_mm_mlp_adapter": false, "freeze_mm_vision_resampler": false, "head_dim": 128, "hidden_act": "silu", "hidden_size": 4096, "image_aspect_ratio": "pad", "initializer_range": 0.02, "intermediate_size": 11008, "max_length": null, "max_position_embeddings": 4096, "mlp_bias": false, "mm_attention_probs_dropout_prob": 0.1, "mm_hidden_act": "gelu", "mm_hidden_dropout_prob": 0.1, "mm_hidden_size": 1024, "mm_image_tower": "./checkpoints/LanguageBind_Image", "mm_intermediate_size": 4096, "mm_layer_norm_eps": 1e-12, "mm_num_attention_heads": 8, "mm_patch_merge_type": "flat", "mm_projector_lr": null, "mm_projector_type": "rmt_r_transformer1x", "mm_resampler_type": null, "mm_use_im_patch_token": false, "mm_use_im_start_end": false, "mm_use_x_patch_token": false, "mm_use_x_start_end": false, "mm_video_tower": "./checkpoints/LanguageBind_Video_merge", "mm_vision_select_feature": "patch", "mm_vision_select_layer": -2, "mm_vision_tower": "openai/clip-vit-large-patch14-336", "model_type": "llava_llama", "num_attention_heads": 32, "num_frames": 16, "num_hidden_layers": 32, "num_key_value_heads": 32, "pad_token_id": 0, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 10000.0, "tie_word_embeddings": false, "tokenizer_model_max_length": 2048, "tokenizer_padding_side": "right", "torch_dtype": "bfloat16", "transformers_version": "4.45.0.dev0", "tune_mm_mlp_adapter": false, "tune_mm_vision_resampler": false, "unfreeze_mm_vision_tower": false, "use_cache": true, "use_mm_proj": true, "vocab_size": 32000 }
I am quite confused now, as the configurations are almost the same. I have tried different checkpoints in previous runs and have hardly seen a score lower than 49.2 (only when using num_frames = 8 and k = 3, I got 48). I am rerunning the code, and it may take 25-30 hours due to the inefficiency of our disk. In the meantime, could you please change NUM_FRAMES
to 16 in mvbench.sh
to align with the training settings and re-evaluate?
@YiwuZhong Hi, Yiwu. I apologize for the inconvenience, but could you please increase the NUM_FRAMES
in mvbench.sh
to 16 (you can also decrease the k) and re-evaluate to check if everything is functioning normally? The process should take only 15-20 minutes on 4 GPUs. Due to my limited computing resources and the heavy load of current projects, I would greatly appreciate your assistance with this. Thank you for your understanding.
@patrick-tssn I've tried evaluating the models with #frames and #seg adjusted as you mentioned. But they only provided < 1.0 improvement. I guess the gap comes from the training phase, instead of the inference stage. Please let me know if you can reproduce the results once the training is done. Thanks!
@YiwuZhong Hi Yiwu, sorry for the late reply. I have rerun the repository on 4 A800 GPUs and re-evaluated with NUM_FRAMES=16
(the only change in the current repository) three times to prevent possible instability from the sampling strategy in the generation function. The results were 49.05, 49.45 49.05, which are consistent with the level reported in our paper, i.e., 49.33.
I have uploaded the environment and training logs (TensorBoard log) to Google Drive.
This is my CUDA version:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
This is my OS version:
NAME="Ubuntu"
VERSION="20.04.5 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.5 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
I currently have no idea why this is happening.
Thanks for running the experiments. Do you mean the current repository is to reproduce the alpha model in Table 3, instead of the beta model?
@YiwuZhong ,Yes, as I stated in this issue, I haven't released the VideoLLaMB\beta. Regarding the difference, the released JSON from PLLaVA does not include the captions and QA related to WebVid videos. You can see their uploaded magicjson and these issue47, issue52. To stay aligned with their settings, we use their released magicjson. The VideoLLaMB\beta is trained on the JSON from the original VideoChat2 videos to fairly compare with VideoChat2-vicuna-7b.
@YiwuZhong Hi, Yiwu, Is there anything else I can do for you? I will do my best to assist you.
Thanks for your help and clarification above.
For now, I don't have other questions. I'll also try looking into what causes the small gap. Thanks again for sharing this good work!
Closed temporarily. Feel free to reopen if you encounter any issues.
@patrick-tssn Thanks for releasing the code!
I tried to train the model using the default code
finetune_video_image.slurm
. The result on MVBench turns out to be 47.3. I also tested the pre-trained modelvideollamb-llava-1.5-7b
which achieves 49.1. These results are obviously lower than the paper result 52.5. One difference is that the default code usesLlavaLlamaForCausalLM
, while pre-trained model usesLlavaLlamaForCausalLMRMT
.Do authors have any idea about this performance gap? Thanks.