-
[Region-Focused Network for Dense Captioning](https://github.com/VILAN-Lab/DesCap)无法访问了,希望能分享这篇文章的代码
-
hi,
I had problems evaluating the MMT-Bench_VAL_MI, as shown below:
`abilities: ['navigation', 'image_matting', nan, 'meme_vedio_understanding', 'single_object_tracking', 'counting_by_visual_prompt…
-
Hello author, when I tested the performance of the provided timechat_7b.pth, I found that the measured indicators were lower than the results reported in the paper. I fine-tuned Timechat according to …
-
Large Multimodal Models (LMMs) extend Large Language Models to the vision domain. Initial LMMs used holistic images and text prompts to generate ungrounded textual responses. Recently, region-level LM…
-
the test dataset for evaluating VG benchmark have some problem, in original GRiT test set [https://datarelease.blob.core.windows.net/grit/VG_preprocessed_annotations/test.json](url) , each image has m…
-
Thank you very much for your excellent work. While I was reproducing the results of your activitynet dataset on the video grounding task, I noticed that the system often threw exceptions, as shown in …
-
![samspace](https://user-images.githubusercontent.com/43663476/110859684-67df1f80-8281-11eb-95a5-d8607969c373.jpg)
* [Matt Daniels](https://twitter.com/mygunisquick)
* [Mike Elgan](https://twitter…
-
-
Hi authors,
I am new to this task and want to consult a question about the metric in this 3D dense captioning domain, which is a little bit contradictory after I checked with several papers.
In …
-
Hi, first I would like to extend my sincere gratitude for your work in this field. This is a very interesting and exciting work. Below is my question:
my question is about the output of the timest…