farewellthree / STAN

Official PyTorch implementation of the paper "Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring"
Apache License 2.0
90 stars 3 forks source link

Has anyone reproduced successfully? #10

Open Arsiuuu opened 8 months ago

farewellthree commented 8 months ago

The source code was based on MMCV1.4 and I have reproduced successfully on MMCV2.0. However, somebody has reported that it is hard to reproduce the results in the popular CLIP4clip codebase. I will release the MMCV2.0 training pipeline to address this issue soon.

Arsiuuu commented 8 months ago

Very looking forward to it, as well as the BT-Adapter! Best Wishes.

Arsiuuu commented 8 months ago

@farewellthree I see the codes have been updated. Is this the final version?

farewellthree commented 8 months ago

Yes, it is the final version for STAN. The readme has also been updated. The code of BT-Adapter will come soon.

Arsiuuu commented 8 months ago

Thanks for your generosity. But when I tried to train STAN-B16 on MSR-VTT, the R@1 is only 35.4 after 10 epochs, much lower than 50.0/54.0 in paper. I used 4*A100 so I just changed batchsize to 32 per GPU to match the 128 mentioned in paper, and followed the default config (https://github.com/farewellthree/STAN/blob/main/configs/exp/stan/stan_msrvtt_b16_hf.py) for the rest. Is something wrong here? Thank you~ image

farewellthree commented 8 months ago

35.4 is much lower than the CLIP baseline, there must be something wrong. What is about the results of CLIP-B16 or the STAN-B32?

Arsiuuu commented 8 months ago

@farewellthree the best results on STAN B-32 Epoch(val) [6][16/16] retrieval/R1: 45.4000 retrieval/R5: 72.4000 retrieval/R10: 81.8000 retrieval/MdR: 2.0000 retrieval/MnR: 14.3850 data_time: 0.1501 time: 0.3290

farewellthree commented 8 months ago

I have tried the b16 again and got normal results. Did u load the pretrained weights from hf correctly?

Arsiuuu commented 8 months ago

I didn't modify any codes but the batchsize, could you please tell me how I should check?

Arsiuuu commented 8 months ago

If I use the correct MSR-VTT annotations? Please look at these. This is my train_9k.json: {"video0.mp4": ["a car is shown", "a group is dancing", "a man drives a vehicle through the countryside", "a man drives down the road in an audi", "a man driving a car", "a man is driving a car", "a man is driving down a road", "a man is driving in a car as part of a commercial", "a man is driving", "a man riding the car speedly in a narrow road", "a man showing the various features of a car", "a man silently narrates his experience driving an audi", "a person is driving his car around curves in the road", "a person telling about a car", "guy driving a car down the road", "man talking about a car while driving", "the man drives the car", "the man driving the audi as smooth as possible", "a man is driving", "guy driving a car down the road"], "video1.mp4": ["in a kitchen a woman adds different ingredients into the pot and stirs it", "a woman puts prawns and seasonings into a large pot on a stove", "in the kitchen a woman makes a dish by adding ingredients mixing and allowing to boil on flame", "a woman adding ingredients to a pot on the stove and stirring", "instructions on how to cook a dish of prawns or crayfish are given on screen while the chef prepares the dish", "a woman is in the kitchen making a recipe in a large pot with many ingredients", "a woman adds some packets of spices and spoonfuls of tomato sauce to a pot then stirs it and covers the pot", "a person add ingredients to a pot in a counter than stirs it", "a person puts items in a pot on the stove in the kitchen", "a woman cooking food with a metal pan on top of a stove", "a woman adds different ingredients into a a pot on the stove", "a woman in a kitchen is cooking a stew in a large pan on her stove", "a women in a multi-color outfit is cooking a stew type dish in a silver pot", "a woman adds ingredients to a pot that is simmering on a stove", "a woman is preparing a seafood stew recipe on a stove demonstrating each step herself while at the same time the easy to read directions", "in a kitchen a lady preferred crayfish with mixing of curry powder", "a woman and a bowl spoon mixing dish inside kitchen to prepare to serve to eat displaying on screen", "cooking the dried smoked prawn in a vessel having boiled water and the lied closed", "a lady is making dried prawns curry and she added tomato puree and salt in it", "a woman in a colorful scarf is showing how to make a stew"], "video2.mp4": ["a guying showing a tool",

This is my test_JSFUSION.json: {"video9770.mp4": ["a person is connecting something to system"], "video9771.mp4": ["a little girl does gymnastics"], "video7020.mp4": ["a woman creating a fondant baby and flower"], "video9773.mp4": ["a boy plays grand theft auto 5"], "video7026.mp4": ["a man is giving a review on a vehicle"], "video9775.mp4": ["a man speaks to children in a classroom"], "video9776.mp4": ["one micky mouse is talking to other"], "video7025.mp4": ["a naked child runs through a field"], "video9778.mp4": ["a little boy singing in front of judges and crowd"], "video9779.mp4":

farewellthree commented 8 months ago

The split is right. Could upload or email me the log file (ruyang@stu.pku.edu.cn)? It is saved under the work_dir.