Open HYOJINPARK opened 3 weeks ago
Hi, thanks for your interest.
Unfortunately, we don't have results w/o Valley, but I believe that it mainly contributes to general video tasks (e.g., qa, captioning), instead of the time-sensitive tasks (e.g., temporal grounding). If you focus on the latter type of tasks, it's ok to only use TimeIT.
Using utils/compress_video_data.py
helps accelerate data loading and processing, you can do so if you want this :) For ActivityNet, I remember that it will print a lot of warning messages if you don't use utils/compress_video_data.py
.
This message means that the target video is broken or missed, so the program will sample another video as a replacement. Generally, it doesn't matter if the broken/missing situations are rare. Otherwise, the model performance may be influenced since too many training samples are missing.
Please refer to https://huggingface.co/datasets/luoruipu1/Valley-Instruct-65k
Hi @RenShuhuai-Andy
Thanks for your reply. Actually, the code of link " https://huggingface.co/datasets/luoruipu1/Valley-Instruct-65k" for Jukin dataset does not work anymore. I guess they block to download dataset.
Do you use compress_video_data.py for every video dataset? Also, which ActivityNet version is used? Actually I suprised that the accuracy is increased after applying preprocessing to activitynet ( 36.9 ->39.0). Even though, it is still low accuracy.
I guess they block to download dataset.
Sorry to hear that, you can find if there is another way to download the dataset.
Do you use compress_video_data.py for every video dataset?
No, we only use compress_video_data.py
for youcook2 and activitynet
which ActivityNet version is used?
release 1.3 (the latest release) if I remember correctly
Actually I suprised that the accuracy is increased after applying preprocessing to activitynet ( 36.9 ->39.0).
Yes it may happen
Even though, it is still low accuracy.
Actually, I'm confused about your posed table.
What's the eval dataset? Charades-STA?
What's the training dataset? only Charades for result 1, Charades+ActivityNet for result 2 and 3?
Yes it is Charades-STA It is zero-shot and thus I used TImeIT and valley ( only partial of vatex) by following stage2_finetune_time104k_valley72k.yaml
Result1 is TImeIT (without ActivityNet) + vatex Result2 and Result3 are TImeIT (with ActivityNet) + vatex
I did not use Charades-STA, and zero-shot performance is 32.2 (iou=0.5) and 13.4 (iou 0.7) as following Table2.
Hi, sorry for the late reply.
The reproduced performance is indeed much lower. According to our ablation study in table 7, we can achieve 34.9 R@1 (iou=0.5) with only DVC and TVG data.
Can you post your training config? What about increasing the training steps (e.g., double the training steps)?
Hi, thanks for great code and amazing work
I do my best to to make similar performance but I get some trouble. Could I get some of your advice?
TimeIT
Valley
ActivityNet I downloaded ActivityNet dataset from ther private Google drive. And I merged V1_2, V1_3 and missing_files (train and val except test) and did util/compress_video_data.py However, valid the nubmer of video is just 9057 not 10009. When I check the folder of "Anet_videos_15fps_short256", therer is all the video (10009) in there.
Preprocessing I followed Data.md for utils/process_hirest.py and utils/process_valley.py I did util/compress_video_data.py only for the AcitivityNet
Reproducing results
result1 is without activityNet result2 is with activityNet but without activityNet preprocessing result3 is with activityNet and with activityNet preprocessing
Question
Thanks for reading