-
I wonder why batch_size is set to 1, does the bigger batch_size cause the worse results?
-
||link|
|----|---|
|paper| [HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips](https://openaccess.thecvf.com/content_ICCV_2019/papers/Miech_HowTo100M_Learni…
-
I tried to run and eval captioning with your pretrained model and YouCookII dataset, but met this issue. I follow the instructions, but every hyp on eval_epoch is none and every scores is 0.0 (do not…
-
Besides, how can we prepare the data files like *.label.tsv / *.caption.tsv / *.caption.linelist.tsv to train SwinBert on our own dataset? Thank you very much ~
-
Hi, I am going to reproduce the reported performance on MSVD dataset with CIDEr of 120.6, but there exists a gap. In my experiment, the first evaluation after the initialization is poor, the initializ…
-
I want to only input text feature or video feature in UniVL. In this paper, it said that one transformer combines text representation **T** and video representation **V**. Could you tell me how to cha…
-
I know that's a long shot, but has anyone downloaded the whole dataset and can tell me how much GB/TB I can expect it to be? Thank you
-
Hi,
Thank you for sharing your impressive work! Equipping LLMs with temporal understanding is indeed a challenging task. I have a question regarding the ActivityNet results:
Are the scores you r…
-
Hello, there is an error at '-x' in the following code, is it a problem with the numpy version?
```
import numpy as np
def compute_metrics(x):
sx = np.sort(-x, axis=1)
d = np.diag(-x)…
-
Hi, thank you for sharing your work and congratulations on the paper!
I am trying to use COOT to create video descriptions for videos that aren't in ActivityNet. I saw your [comment ](https://githu…