bigai-nlco / VideoLLaMB

Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges
https://videollamb.github.io/
32 stars 0 forks source link

Cannot Reproduce Paper Results #4

Closed YiwuZhong closed 10 hours ago

YiwuZhong commented 3 days ago

@patrick-tssn Thanks for releasing the code!

I tried to train the model using the default code finetune_video_image.slurm. The result on MVBench turns out to be 47.3. I also tested the pre-trained model videollamb-llava-1.5-7b which achieves 49.1. These results are obviously lower than the paper result 52.5. One difference is that the default code uses LlavaLlamaForCausalLM, while pre-trained model uses LlavaLlamaForCausalLMRMT.

Do authors have any idea about this performance gap? Thanks.

patrick-tssn commented 3 days ago

I'm sorry to hear that you're experiencing difficulties. I'm not sure why you're unable to reproduce the result by running finetune_video_image.slurm. Could you please double-check the training data? Also, ensure that you are adhering closely to our settings.

Regarding the pretrained checkpoint, we have not yet released the model trained on the full videochat2 set, i.e., VideoLLaMB\beta as shown in Table 3. The model we have made available is VideoLLaMB\alpha, which is the default setting for all our other experiments. As reported in our paper, this model achieved a result of 49.33 on MVBench.

YiwuZhong commented 3 days ago

Thanks for your quick reply.

Yes, I've checked that the image and video data are intact, by going through json annotations and local files. I was wondering what should be the final model LlavaLlamaForCausalLM vs. LlavaLlamaForCausalLMRMT, and which should be the final script finetune_video_image.slurm vs. finetune_video_image_l.slurm. Also, have you tried using the released default code to reproduce the paper result, such as 52.5 on MVBench?

patrick-tssn commented 3 days ago

In fact, I simply copied the code into a new repository to ensure clarity. By the way, LlavaLlamaForCausalLMRMT is a legacy class that we have not utilized in the current version of our work.

YiwuZhong commented 3 days ago

Thank you for the clarification. Also, I wanted to make sure that script to reproduce paper results is finetune_video_image.slurm, instead of finetune_video_image_l.slurm.

patrick-tssn commented 3 days ago

I suspect there's a typographical error here; you are encouraged to use finetune_video_image_l.slurm. However, there is no significant difference between these two scripts.

patrick-tssn commented 3 days ago

Just update the script: finetune_video_image.slurm will call LlavaLlamaForCausalLM finetune_video_image_loss.slurm will call LlavaLlamaForCausalLMRMT there is no significant difference between these two settings

I'm still unsure why you're unable to reproduce the result. However, there are a couple of hyperparameters you could adjust: in llava/model/multimodal_projector/rmt_r_transformer_projector.py at line 350, try changing the number of scene segments to 1 for MVBench, as the videos there are relatively short.

patrick-tssn commented 3 days ago

@YiwuZhong I just fixed a bug in the memory cache and carefully re-evaluated the results. Could you please try again? If the results are still abnormally low, please don't hesitate to let me know.

YiwuZhong commented 3 days ago

@patrick-tssn Thanks for your information. One more hyperparameter: What is the batch size (the total batch size, instead of per GPU batch size) and paired learning rate? The script is different from the paper, due to the number of GPUs.

patrick-tssn commented 3 days ago

finetune_video_image.slurm: the total batchsize is 32, and the paired learning rate is 2e-5 finetune_video_image_loss.slurm: the total batchsize is 16, and the paired learning rate is 2e-5

YiwuZhong commented 2 days ago

@patrick-tssn The updated code and script are not able to improve the results.

Specifically, setting the number of scene segments to 1 only improves performance by 0.8 on MVBench. However, the model trained by finetune_video_image.slurm is still lower than the paper result by around 4 points. It'd be great if you could try training a model using the current version of code. Thanks!

One more technical question: The video encoder comes from LanguageBind. Have you tried using image encoder to encode each frame individually, instead of the video encoder?

patrick-tssn commented 2 days ago

It's quite strange; in my experiments, 49.33 serves as the lower bound on MVBench. Are you utilizing the dataset from https://huggingface.co/datasets/ColorfulAI/VideoLLaMB-IT/tree/main

I have only four A800; rerunning the code could take some time.

I've tested the clip-vit but haven't observed any notable enhancements so far on EgoSchema.

YiwuZhong commented 2 days ago

Yes, I used the filtered annotations in that link. In total, there are 625K image samples, 367K video samples, and 41K language samples.

patrick-tssn commented 2 days ago

@YiwuZhong Could you please provide the config.json file from your checkpoint?

YiwuZhong commented 2 days ago

{ "X": [ "VIDEO", "IMAGE" ], "_name_or_path": "./checkpoints/llava-v1.5-7b", "architectures": [ "LlavaLlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 1, "eos_token_id": 2, "freeze_mm_mlp_adapter": false, "freeze_mm_vision_resampler": false, "head_dim": 128, "hidden_act": "silu", "hidden_size": 4096, "image_aspect_ratio": "pad", "initializer_range": 0.02, "intermediate_size": 11008, "max_length": null, "max_position_embeddings": 4096, "mlp_bias": false, "mm_attention_probs_dropout_prob": 0.1, "mm_hidden_act": "gelu", "mm_hidden_dropout_prob": 0.1, "mm_hidden_size": 1024, "mm_image_tower": "./checkpoints/LanguageBind_Image", "mm_intermediate_size": 4096, "mm_layer_norm_eps": 1e-12, "mm_num_attention_heads": 8, "mm_patch_merge_type": "flat", "mm_projector_lr": null, "mm_projector_type": "rmt_r_transformer1x", "mm_resampler_type": null, "mm_use_im_patch_token": false, "mm_use_im_start_end": false, "mm_use_x_patch_token": false, "mm_use_x_start_end": false, "mm_video_tower": "./checkpoints/LanguageBind_Video_merge", "mm_vision_select_feature": "patch", "mm_vision_select_layer": -2, "mm_vision_tower": "openai/clip-vit-large-patch14-336", "model_type": "llava_llama", "num_attention_heads": 32, "num_frames": 16, "num_hidden_layers": 32, "num_key_value_heads": 32, "pad_token_id": 0, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 10000.0, "tie_word_embeddings": false, "tokenizer_model_max_length": 2048, "tokenizer_padding_side": "right", "torch_dtype": "bfloat16", "transformers_version": "4.45.0.dev0", "tune_mm_mlp_adapter": false, "tune_mm_vision_resampler": false, "unfreeze_mm_vision_tower": false, "use_cache": true, "use_mm_proj": true, "vocab_size": 32000 }

patrick-tssn commented 2 days ago

I am quite confused now, as the configurations are almost the same. I have tried different checkpoints in previous runs and have hardly seen a score lower than 49.2 (only when using num_frames = 8 and k = 3, I got 48). I am rerunning the code, and it may take 25-30 hours due to the inefficiency of our disk. In the meantime, could you please change NUM_FRAMES to 16 in mvbench.sh to align with the training settings and re-evaluate?

patrick-tssn commented 1 day ago

@YiwuZhong Hi, Yiwu. I apologize for the inconvenience, but could you please increase the NUM_FRAMES in mvbench.sh to 16 (you can also decrease the k) and re-evaluate to check if everything is functioning normally? The process should take only 15-20 minutes on 4 GPUs. Due to my limited computing resources and the heavy load of current projects, I would greatly appreciate your assistance with this. Thank you for your understanding.

YiwuZhong commented 1 day ago

@patrick-tssn I've tried evaluating the models with #frames and #seg adjusted as you mentioned. But they only provided < 1.0 improvement. I guess the gap comes from the training phase, instead of the inference stage. Please let me know if you can reproduce the results once the training is done. Thanks!

patrick-tssn commented 22 hours ago

@YiwuZhong Hi Yiwu, sorry for the late reply. I have rerun the repository on 4 A800 GPUs and re-evaluated with NUM_FRAMES=16 (the only change in the current repository) three times to prevent possible instability from the sampling strategy in the generation function. The results were 49.05, 49.45 49.05, which are consistent with the level reported in our paper, i.e., 49.33.

I have uploaded the environment and training logs (TensorBoard log) to Google Drive.

This is my CUDA version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

This is my OS version:

NAME="Ubuntu"
VERSION="20.04.5 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.5 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

I currently have no idea why this is happening.

YiwuZhong commented 22 hours ago

Thanks for running the experiments. Do you mean the current repository is to reproduce the alpha model in Table 3, instead of the beta model?

patrick-tssn commented 21 hours ago

@YiwuZhong ,Yes, as I stated in this issue, I haven't released the VideoLLaMB\beta. Regarding the difference, the released JSON from PLLaVA does not include the captions and QA related to WebVid videos. You can see their uploaded magicjson and these issue47, issue52. To stay aligned with their settings, we use their released magicjson. The VideoLLaMB\beta is trained on the JSON from the original VideoChat2 videos to fairly compare with VideoChat2-vicuna-7b.

patrick-tssn commented 19 hours ago

@YiwuZhong Hi, Yiwu, Is there anything else I can do for you? I will do my best to assist you.

YiwuZhong commented 19 hours ago

Thanks for your help and clarification above.

For now, I don't have other questions. I'll also try looking into what causes the small gap. Thanks again for sharing this good work!

patrick-tssn commented 10 hours ago

Closed temporarily. Feel free to reopen if you encounter any issues.