-
We current have `multimodal_chat_dataset` which is great for conversations on an image, but many VQA datasets are structured more like instructions where there is a question column, answer column, and…
-
## Issue
I keep getting `nan` loss when training Llama-3.2-vision
I tried:
- gradient clipping
- lower learning rate
- higher batch size, lora rank and alpha
But with no success.
## …
-
Hi,
I encountered the following issue when I ran the `train.sh` file under `src/experiments/vqa/` :
```shell
Traceback (most recent call last):
File "train.py", line 24, in
from dataGe…
-
## タイトル: WorldCuisines: 世界の料理に関する多言語・多文化視覚質問応答の大規模ベンチマークデータセット
## リンク: https://arxiv.org/abs/2410.12705
## 概要:
視覚言語モデル(VLM)は、特に英語以外の言語や、十分に表現されていない文化的背景において、文化固有の知識の理解に苦戦することが多いです。このような知識に対するVLMの理…
-
Dear Maintainers,
I'm currently trying to reproduce the zero-shot results of instructblip. The caption of table5 says that for datasets with OCR tokens, the image query embeddings are simply append…
-
非常感谢你们的工作!
请问你们是如何评估open question的,例如vqa-rad数据集。使用的是llava的prompt template吗?
谢谢!
-
### Which component impacted?
Video Processing
### Is it regression? Good in old configuration?
Yes, it's good in old version
### What happened?
Corrupted output file.
-------------- Reproducibl…
-
Hello Author:
I have recently reproduced your paper, and according to the data set you gave, it is 'number': 50.91 in vqa-2.0.
'other': 59.45,
'overall': 69.13,
'yes/no': 85.29}}
The result is a …
erjpc updated
4 months ago
-
-
Hello
Thanks for your great work!
Is the code of the video-mamba-suite on EgoSchema released?