-
-
There are doubts in the preprocessing process, is the spatial language model in the code the four datasets of train, test, val, and split in the WMT16 multimodal translation task, and what code comman…
-
Are there any ways to bypass the data-preprocessing step for MBT ("Attention Bottlenecks for Multimodal Fusion") if I only wanna do inference without passing in the actual data from AS? I notice the m…
-
Dear Tada
I am wondering if OpenFace has been used to identify face landmarks of Children. Also, I am assuming that children face landmarks can be identified with OpenFace but I am not sure about t…
-
### Describe the issue
When will the llava-1.6 training dataset and training code be open-sourced?
Hello, I'm glad to see that the performance of llava-1.6 has improved so significantly. I believe i…
-
- [ ] NQ (https://ai.google.com/research/NaturalQuestions/dataset)
- [ ] TriviaQA (https://nlp.cs.washington.edu/triviaqa/) datasets
- [ ] HotpotQA
- [ ] DROP
-
Hi, nice work!
Do you have a plan to release the evaluation code of SHOW-1 in UCF-101 and MSRVTT? If you can open source the evaluation code, I believe that future work can be fairly compared to sh…
-
Hello,
I'm working on reproduce the results in your paper "Attention Bottlenecks for Multimodal Fusion" and try to implement MBT for other audiovisual video classification tasks.
However, the pr…
-
Hi,
Thank you very much for releasing the source code of your work. I noticed that you use CheXpert for multimodal pre-training of your model. However, as far as I'm aware, the CheXpert dataset doe…
-
### Describe your use-case.
There are multiple simple models used in this repository: Blip, Clip and WD-taggers. However, when it comes to detailed description, they are all dwarfed by modern multi…