Cannot reproduce results on vllava datasets

williamium3000 commented 2 months ago

Dear authors of VideoLLaMA2, Thanks for the great work. We tried to reproduce your results on vllava datasets using the latest version of the code. However, we observe a large discrepancy in the three test datasets.

Model	MVBench	Egoschema	ActivityNet	Avg
reported	45.5	42.2	47.6	45.1
reproduced	44.475	38.5	43.55	42.175

We directly use your code, and follow your instructions to download the vllava datasets as well as three test sets, i.e. MVBench, Egoschema, and ActivityNet.

Can you hint at how you achieved the average 45.1 results?

Best Yijiang

clownrat6 commented 2 months ago

Please adopt the new scripts to train our videollama2 under videollava settings. Previous scripts adopt another projector, which is not consistent with the projector of this experiment.

williamium3000 commented 2 months ago

I am using the lastest. Unless you guys updated within two days. My results were obatined with the code pulled last weekend. I use the pretrain.sh and finetune.sh in scripts/vllava

lixin4ever commented 2 months ago

Please adopt the new scripts to train our videollama2 under videollava settings. Previous scripts adopt another projector, which is not consistent with the projector of this experiment.

Hi Yijiang, my colleague may not state this clearly :joy: We updated the fine-tuning script this afternoon, please check the latest commit and launch your training jobs (on video-llava dataset) with the new script.

williamium3000 commented 2 months ago

Oh thanks!!! Sorry for the misunderstanding. I will try tonight.

williamium3000 commented 2 months ago

Hi @lixin4ever and @clownrat6, We have switched to connector-v35 but still cannot reproduce. The results are even lower than the first version. Model	MVBench	Egoschema	ActivityNet	Avg
reported	45.5	42.2	47.6	45.1
reproduced	44.475	38.5	43.55	42.175
reproduced connector-v35	43.5	35.38	41.48	40.12

williamium3000 commented 2 months ago

We attached all the json files generated here: config.json generation_config.json model.safetensors.index.json special_tokens_map.json tokenizer.json tokenizer_config.json trainer_state.json

williamium3000 commented 2 months ago

env we use: name: videollama channels:

defaults dependencies:
_libgcc_mutex=0.1=main
_openmp_mutex=5.1=1_gnu
ca-certificates=2024.7.2=h06a4308_0
ld_impl_linux-64=2.38=h1181459_1
libffi=3.4.4=h6a678d5_1
libgcc-ng=11.2.0=h1234567_1
libgomp=11.2.0=h1234567_1
libstdcxx-ng=11.2.0=h1234567_1
ncurses=6.4=h6a678d5_0
openssl=3.0.14=h5eee18b_0
pip=24.0=py39h06a4308_0
python=3.9.19=h955ad1f_1
readline=8.2=h5eee18b_0
sqlite=3.45.3=h5eee18b_0
tk=8.6.14=h39e8969_0
wheel=0.43.0=py39h06a4308_0
xz=5.4.6=h5eee18b_1
zlib=1.2.13=h5eee18b_1
pip:
- absl-py==2.1.0
- accelerate==0.33.0
- aiofiles==23.2.1
- aliyun-python-sdk-core==2.15.1
- aliyun-python-sdk-kms==2.16.3
- altair==5.3.0
- annotated-types==0.7.0
- anyio==4.4.0
- attrs==24.2.0
- beautifulsoup4==4.7.1
- beautifultable==0.7.0
- bitsandbytes==0.43.0
- boto3==1.34.158
- botocore==1.34.158
- bypy==1.8.5
- certifi==2024.7.4
- cffi==1.17.0
- chardet==4.0.0
- charset-normalizer==3.3.2
- click==8.1.7
- contourpy==1.2.1
- crcmod==1.7
- cryptography==43.0.0
- cycler==0.12.1
- decorator==4.4.2
- decord==0.6.0
- deepspeed==0.14.4
- dill==0.3.8
- distro==1.9.0
- docker==3.6.0
- docker-pycreds==0.4.0
- einops==0.6.1
- einops-exts==0.0.4
- evalai==1.3.18
- exceptiongroup==1.2.2
- fastapi==0.112.0
- ffmpy==0.4.0
- filelock==3.14.0
- flash-attn==2.5.8
- fonttools==4.53.1
- fsspec==2024.6.1
- gitdb==4.0.11
- gitpython==3.1.43
- gradio==3.50.0
- gradio-client==0.6.1
- grpcio==1.65.4
- h11==0.14.0
- hf-transfer==0.1.8
- hjson==3.1.0
- httpcore==0.17.3
- httpx==0.24.1
- huggingface-hub==0.23.4
- idna==2.10
- imageio==2.34.0
- imageio-ffmpeg==0.4.9
- importlib-metadata==8.2.0
- importlib-resources==6.4.0
- jinja2==3.1.4
- jiter==0.5.0
- jmespath==0.10.0
- joblib==1.4.2
- jsonschema==4.23.0
- jsonschema-specifications==2023.12.1
- kiwisolver==1.4.5
- latex2mathml==3.77.0
- lxml==4.6.2
- markdown==3.6
- markdown-it-py==3.0.0
- markdown2==2.5.0
- markupsafe==2.1.5
- matplotlib==3.9.1.post1
- mdurl==0.1.2
- moviepy==1.0.3
- mpmath==1.3.0
- multiprocess==0.70.16
- networkx==3.2.1
- ninja==1.11.1.1
- numpy==1.24.4
- nvidia-cublas-cu12==12.1.3.1
- nvidia-cuda-cupti-cu12==12.1.105
- nvidia-cuda-nvrtc-cu12==12.1.105
- nvidia-cuda-runtime-cu12==12.1.105
- nvidia-cudnn-cu12==9.1.0.70
- nvidia-cufft-cu12==11.0.2.54
- nvidia-curand-cu12==10.3.2.106
- nvidia-cusolver-cu12==11.4.5.107
- nvidia-cusparse-cu12==12.1.0.106
- nvidia-ml-py==12.560.30
- nvidia-nccl-cu12==2.20.5
- nvidia-nvjitlink-cu12==12.6.20
- nvidia-nvtx-cu12==12.1.105
- openai==1.40.1
- opencv-python==4.6.0.66
- openxlab==0.1.1
- orjson==3.10.6
- oss2==2.17.0
- packaging==24.1
- pandas==2.2.2
- peft==0.4.0
- pillow==10.4.0
- platformdirs==4.2.2
- proglog==0.1.10
- protobuf==4.25.4
- psutil==6.0.0
- py-cpuinfo==9.0.0
- pyarrow==17.0.0
- pycocotools==2.0.8
- pycparser==2.22
- pycryptodome==3.20.0
- pydantic==2.8.2
- pydantic-core==2.20.1
- pydub==0.25.1
- pygments==2.18.0
- pynvml==11.5.3
- pyparsing==3.1.2
- pysubs2==1.7.3
- python-dateutil==2.9.0.post0
- python-multipart==0.0.9
- pytz==2023.4
- pyyaml==6.0.2
- referencing==0.35.1
- regex==2024.7.24
- requests==2.28.2
- requests-toolbelt==1.0.0
- rich==13.4.2
- rpds-py==0.20.0
- s3transfer==0.10.2
- safetensors==0.4.4
- scenedetect==0.6.3
- scikit-learn==1.2.2
- scipy==1.13.1
- semantic-version==2.10.0
- sentencepiece==0.1.99
- sentry-sdk==2.12.0
- setproctitle==1.3.3
- setuptools==60.2.0
- shortuuid==1.0.13
- six==1.16.0
- smmap==5.0.1
- sniffio==1.3.1
- soupsieve==2.5
- starlette==0.37.2
- svgwrite==1.4.3
- sympy==1.13.1
- tabulate==0.9.0
- tensorboard==2.17.0
- tensorboard-data-server==0.7.2
- termcolor==1.1.0
- threadpoolctl==3.5.0
- timm==1.0.3
- tokenizers==0.19.1
- toolz==0.12.1
- torch==2.4.0
- torchaudio==2.4.0
- torchvision==0.19.0
- tqdm==4.65.2
- transformers==4.44.2
- triton==3.0.0
- typing-extensions==4.12.2
- tzdata==2024.1
- urllib3==1.26.19
- uvicorn==0.30.5
- validators==0.12.6
- videollama2==1.0
- wandb==0.17.5
- wavedrom==2.0.3.post3
- websocket-client==1.8.0
- websockets==11.0.3
- werkzeug==3.0.3
- xformers==0.0.27.post2
- zipp==3.19.2 prefix: /share/yijiangli/docker_conda/envs/videollama

zhuqiangLu commented 1 month ago

Hi, I am going to reproduce this experiment. May I ask how many gpus did you use and how many days it took to run?

lixin4ever commented 1 month ago

Hi @lixin4ever and @clownrat6, We have switched to connector-v35 but still cannot reproduce. The results are even lower than the first version.

Model MVBench Egoschema ActivityNet Avg reported 45.5 42.2 47.6 45.1 reproduced 44.475 38.5 43.55 42.175 reproduced connector-v35 43.5 35.38 41.48 40.12

Hi Yijiang, We found that the latest codebase, migrated from the older one (I.e., V1.0) to be better compatible with Qwen2 (and other recent LLMs), indeed suffers from performance degradation when switching the language decoder to Mistral. However, due to the lack of resources, we temporarily have no GPUs to further verify if the code migration leads to this issue. We will continue the verification in early October, please stay tuned.

lixin4ever commented 1 month ago

Hi, I am going to reproduce this experiment. May I ask how many gpus did you use and how many days it took to run?

Two A100/A800 nodes (i.e., 16 GPUs) for < 20 hours (pretraining + fine-tuning)

zhuqiangLu commented 1 month ago

Hi, I am going to reproduce this experiment. May I ask how many gpus did you use and how many days it took to run?

Two A100/A800 nodes (i.e., 16 GPUs) for < 20 hours (pretraining + fine-tuning)

Oh, that is much faster than I thought, thank you. Are you training full model or using LoRA?

zhuqiangLu commented 1 month ago

Hi, I am going to reproduce this experiment. May I ask how many gpus did you use and how many days it took to run?

Two A100/A800 nodes (i.e., 16 GPUs) for < 20 hours (pretraining + fine-tuning)

Also, I just tried using 8xA100 for the pretraining stage, it estimates the pretraining stage will take 48 hours, could you please clarify that the pretraining stage should include both valley and llavaimage dataset?

lixin4ever commented 1 month ago

Yes, both Valley and LLaVA-Image should be included.

Regarding the time cost, I just checked the pretraining log of one run and it took around 8 hours on 2 A800 nodes (i.e., 16 80G-A800s).

zhuqiangLu commented 1 month ago

Yes, both Valley and LLaVA-Image should be included.

Regarding the time cost, I just checked the pretraining log of one run and it took around 8 hours on 2 A800 nodes (i.e., 16 80G-A800s).

Thank you for your response, may I ask for the checkpoint of the model trained on valley dataset? I am keen to see how it performs on my custom dataset.

williamium3000 commented 1 month ago

Hi, I am going to reproduce this experiment. May I ask how many gpus did you use and how many days it took to run?

I use 8 a800 80G GPUs. local and global batch size follows the scripts.

zhuqiangLu commented 1 month ago

Hi there, I have finished my experiment to reproduce the result with vllava. My result seems consistent with the reported result.

Model	MVBENCH	Egoschema	ActivityNet
reported	45.5	42.2	47.6
reproduce	45.65	43.5	-

I will update the performance of ActivityNet later today, the dataset is still downloading.

I have made TWO modifications:

Finetuning with LoRA, r = 128, alpha = 256
Vllava dataset is downloaded from here

@williamium3000 I compared my trainer_state.json with yours, and noticed your grad_norms are much higher than mine. In addition, my loss values are slightly lower as well.

Here is my trainer_state.json

I ran my experiment on a 8 x L40s machine.

zhuqiangLu commented 1 month ago

Here is Egoschema evolution result from Kaggle.

williamium3000 commented 1 month ago

@zhuqiangLu How to train with lora? Do you mean that you first pretrain on the valley pretrain dataset and then use lora to fine-tune on sft dataset?

zhuqiangLu commented 1 month ago

@zhuqiangLu How to train with lora? Do you mean that you first pretrain on the valley pretrain dataset and then use lora to fine-tune on sft dataset?

That is right. The pretraining is done on valley dataset with the official script. To enable lora with SFT, simply add --enable_lora --lora_r 128 --lora_alpha 256 to the fine-tuning script.


torchrun 
--nnodes $WORLD_SIZE \
    --nproc_per_node $NPROC_PER_NODE \
    --master_addr=$MASTER_ADDR \
    --master_port=$MASTER_PORT \
    --node_rank $RANK \
    videollama2/train_flash_attn.py \
    --lora_enable True \
    --lora_r 128 \
    --lora_alpha 256 \
    --deepspeed scripts/zero3.json \
    --model_type videollama2 \
    --model_path mistralai/Mistral-7B-Instruct-v0.2 \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --mm_projector_type stc_connector_v35 \
    --pretrain_mm_mlp_adapter ${OUTP_DIR}/${WANDB_PROJECT}/pretrain_${RUN_NAME}/mm_projector.bin \
    --data_path   ${DATA_DIR}/videollava_sft/videochatgpt_llavaimage_tune.json \
    --data_folder ${DATA_DIR}/videollava_sft/ \
    --mm_vision_select_layer -2 \
    --image_aspect_ratio pad \
    --num_frames 8 \
    --bf16 True \
    --tf32 True \
    --fp16 False \
    --output_dir ${OUTP_DIR}/${WANDB_PROJECT}/finetune_${RUN_NAME} \
    --num_train_epochs 1 \
    --per_device_train_batch_size $LOCAL_BATCH_SIZE \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 99 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --report_to tensorboard \
    --run_name $RUN_NAME \```

williamium3000 commented 1 month ago

I believe their original result is produced by full fine-tuning. I have not tried lora though.

clownrat6 commented 1 month ago

Using v1.0 tag version codebase, I retrain the VideoLLaMA with tcadapterv35 on videollava dataset. It seems that v1.0 version can reproduce the results in paper.

git checkout v1.0
bash scripts/vllava/stc/pretrain.sh
bash scripts/vllava/stc/finetune.sh

The results are:	Model	MVBench	Egoschema
reported	45.5	42.2
reproduce (v1.0 tag)	44.1	41.6

The training loss curve is:

The finetuning loss curve is:

The evaluation results are:

williamium3000 commented 1 month ago

I follow @zhuqiangLu and try to reproduce his result with lora.

Model	MVBENCH	Egoschema	ActivityNet
reported	45.5	42.2	47.6
reproduce by @zhuqiangLu	45.65	43.5	-
my lora	43.625	41.62	44.17

Despite it's higher than the full-finetuning, it's still behind the reported or the results by @zhuqiangLu

zhuqiangLu commented 4 weeks ago

I was thinking could it be the dataset? My vllava dataset was downloaded from huggingface, then I simply use the jsons provided by videollama2.

clownrat6 commented 4 weeks ago

I follow @zhuqiangLu and try to reproduce his result with lora.

Model MVBENCH Egoschema ActivityNet reported 45.5 42.2 47.6 reproduce by @zhuqiangLu 45.65 43.5 - my lora 43.625 41.62 44.17 Despite it's higher than the full-finetuning, it's still behind the reported or the results by @zhuqiangLu

Could you please attempt v1.0 tag version code? You can checkout this version by git checkout v1.0, which will achieve higher performance than main branch.

williamium3000 commented 3 weeks ago

Yeah, we have observed a similar trend. v1.0 does seem better. However, I wondering how @zhuqiangLu achieved such high result. @zhuqiangLu Did you use v1.0 as well?

zhuqiangLu commented 3 weeks ago

It was default branch, but now I have no available GPUs for this training, I will update when I finish training with the v1.0 branch.

DAMO-NLP-SG / VideoLLaMA2

Cannot reproduce results on vllava datasets #81