RenShuhuai-Andy / TimeChat

[CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
https://arxiv.org/abs/2312.02051
BSD 3-Clause "New" or "Revised" License
250 stars 21 forks source link

Ask for reproducing #40

Open HYOJINPARK opened 3 weeks ago

HYOJINPARK commented 3 weeks ago

Hi, thanks for great code and amazing work

I do my best to to make similar performance but I get some trouble. Could I get some of your advice?

  1. Download dataset Unfortunately, some of dataset is impossible to get it and this is what I can download right now.

TimeIT

total number of file : 104403 success : 101377 lost : 3026 yttemporal180m : total num : 31627 but valid 31197 DiDeMo : total num : 33002 but valid 32954 ActivityNet_asr_denseCap : total num : 10009 but valid 9057 vitt : total num : 5141 but valid 5057 COIN : total num : 9029 but valid 7760 QuerYD : total num : 14602 but valid 14429 HiREST : total num : 918 but valid 874 TVSum : total num : 50 but valid 49 SumMe : total num : 25 but valid 0

Valley

total number of file : 72303 success : 14906 lost : 57397 vatex : total num : 36710 but valid 14906 jukin : total num : 35593 but valid 0

  1. ActivityNet I downloaded ActivityNet dataset from ther private Google drive. And I merged V1_2, V1_3 and missing_files (train and val except test) and did util/compress_video_data.py However, valid the nubmer of video is just 9057 not 10009. When I check the folder of "Anet_videos_15fps_short256", therer is all the video (10009) in there.

  2. Preprocessing I followed Data.md for utils/process_hirest.py and utils/process_valley.py I did util/compress_video_data.py only for the AcitivityNet

  3. Reproducing results

image

result1 is without activityNet result2 is with activityNet but without activityNet preprocessing result3 is with activityNet and with activityNet preprocessing

  1. I use 8 GPU and follow same training config file (stage2_finetune_time104k_valley72k.yaml)

Question

  1. Do you have result without using Valley dataset?
  2. Should I do utils/compress_video_data.py for all other dataset?
  3. sometimes I get this warnig in log, is it matter? Failed to load examples with video: ....dataset/vatex/videos/--SOz3xjWfA_000037_000047.mp4. Will randomly sample an example as a replacement.
  4. how did you download jukin dataset?

Thanks for reading

RenShuhuai-Andy commented 3 weeks ago

Hi, thanks for your interest.

  1. Unfortunately, we don't have results w/o Valley, but I believe that it mainly contributes to general video tasks (e.g., qa, captioning), instead of the time-sensitive tasks (e.g., temporal grounding). If you focus on the latter type of tasks, it's ok to only use TimeIT.

  2. Using utils/compress_video_data.py helps accelerate data loading and processing, you can do so if you want this :) For ActivityNet, I remember that it will print a lot of warning messages if you don't use utils/compress_video_data.py.

  3. This message means that the target video is broken or missed, so the program will sample another video as a replacement. Generally, it doesn't matter if the broken/missing situations are rare. Otherwise, the model performance may be influenced since too many training samples are missing.

  4. Please refer to https://huggingface.co/datasets/luoruipu1/Valley-Instruct-65k

HYOJINPARK commented 3 weeks ago

Hi @RenShuhuai-Andy

Thanks for your reply. Actually, the code of link " https://huggingface.co/datasets/luoruipu1/Valley-Instruct-65k" for Jukin dataset does not work anymore. I guess they block to download dataset.

Do you use compress_video_data.py for every video dataset? Also, which ActivityNet version is used? Actually I suprised that the accuracy is increased after applying preprocessing to activitynet ( 36.9 ->39.0). Even though, it is still low accuracy.

RenShuhuai-Andy commented 3 weeks ago

I guess they block to download dataset.

Sorry to hear that, you can find if there is another way to download the dataset.

Do you use compress_video_data.py for every video dataset?

No, we only use compress_video_data.py for youcook2 and activitynet

which ActivityNet version is used?

release 1.3 (the latest release) if I remember correctly

Actually I suprised that the accuracy is increased after applying preprocessing to activitynet ( 36.9 ->39.0).

Yes it may happen

Even though, it is still low accuracy.

Actually, I'm confused about your posed table.

What's the eval dataset? Charades-STA?

What's the training dataset? only Charades for result 1, Charades+ActivityNet for result 2 and 3?

HYOJINPARK commented 3 weeks ago

Yes it is Charades-STA It is zero-shot and thus I used TImeIT and valley ( only partial of vatex) by following stage2_finetune_time104k_valley72k.yaml

Result1 is TImeIT (without ActivityNet) + vatex Result2 and Result3 are TImeIT (with ActivityNet) + vatex

I did not use Charades-STA, and zero-shot performance is 32.2 (iou=0.5) and 13.4 (iou 0.7) as following Table2.

RenShuhuai-Andy commented 5 days ago

Hi, sorry for the late reply.

The reproduced performance is indeed much lower. According to our ablation study in table 7, we can achieve 34.9 R@1 (iou=0.5) with only DVC and TVG data.

Can you post your training config? What about increasing the training steps (e.g., double the training steps)?

image image