-
Appreciate your efforts in maintaining this project!
While I ran the zero-shot VQA inference (generating results) on the MSRVTT dataset, it took 28 hours (using 4 A5000) to finish. I recognize that…
-
Any chance of releasing the weights? I currently lack the compute to train this myself. Thanks!
-
-
Hi authors,
Thanks for the great work!
However, I cannot reproduce the numbers reported in the paper using your code. I use the **LLaVA-1.6-vicuna-7B** model.
Open-ended QA
| | MSVD-QA | MSRVT…
-
The article mentions that "where they randomly chose 5 ground-truth sentences per video. We use the same setting when we compare with that approach".Does the training set, validation set and test set …
-
Hello,
Thank you for the repo and well done for the project.
I have a question on how and if it's possible to train on a single gpu.
-
Hi,
Thanks for your excellent work. I have a few questions when I re-implement CLIP4Clip on the MSR-VTT dataset.
Firstly, I change ```sim_header ``` to ```seqTransf``` to implement the best perf…
-
Hi, can you share the new training configs for MSRVTT?
I ran the reimplemented code without modifying anything except for batch=32 for 4 machines(so total bz=128). And got the best **45.6000** retri…
-
Weights from pretrained model cause errors in UCoFiA:
size mismatch for visual_token_selector.score_predictor.in_conv.0.weight: copying a param with shape torch.Size([512]) from checkpoint, the s…
-
Thanks for the code and documentation. I am running the captioning finetuning experiment on MSRVTT. During the evaluation stage, the code stops with an AssertionError [here](https://github.com/TXH-mer…