-
The HowTo100M + VidChapters-7M + ViTT model is performing poorly on dense video captioning.
Reproduction:
Run
```
yt-dlp -P $TRANSFORMERS_CACHE -o video.mp4 https://www.youtube.com/watch?v=WJ…
-
how do I add region descriptions to an image (grounded image descriptions)? I would like to annotate my images using bounding boxes for the different regions and two or more text descriptions for each…
-
Thank you for the nice work!
Is it possible to use larger ViT backbone for dense captioning?
Is there a reason that there is only ViT-B backbone for dense captioning?
Thank you.
-
It seems a nice work. I wanted to test it on custom input videos. It would be very helpful if you can provide a script for generating video captions for a raw input video.
-
Thanks for the great work! I noticed that in the paper you mentioned that
_"We observe that the major limitation of the BLIP-CLIP evaluation is that the BLIP captioning models
do not always descr…
-
Hello, thank you for your work. I would like to ask why you think the task of synchronized subtitles is important. How can it help in action generation and action understanding?
-
## 一言でいうと
Transformerベースのモデルで、End2Endのビデオキャプションを実現したという研究。Encoder側は動画中からキャプション対象のイベント(時間範囲)を抽出し、Decoder側はイベントにマスクをかけた上で文の生成を行なっていく。
![image](https://user-images.githubusercontent.com/544269/5370…
-
How might EasyAnimate slice a 1080p video? Or more specifically what is the frame interval of which the slicing happens? Assuming this is the memory requirements for resolutions lower than 1080p.
E…
-
When will your group release the code and dataset of dense video object captioning?
-
Hello! Thank you so much for the contribution of this repo.
I'm so interested in this work, and I'm suveying papers with key words like "captioning anything" or "instance level captioning" or "per pi…