-
All the question prompts are extracted from DocStruct4M, 'multi_grained_text_localization.jsonl' as below,
```
[
"Give the bounding box of the text",
"Predict the bounding box of the text",
…
-
Hello,
As I was meticulously reading a paper, I found myself confused about the section on 'projectors.'
Background: From what I understand so far, in the case of CLIP ViT Large, despite the com…
-
LLava supports multiple images by default, what if send T,N,D into LLM without any aggregation?
-
With [ml-ferret](https://github.com/apple/ml-ferret) out, it would be great to include an MLLM example in this repo, namely with ml-ferret or just LlaVA itself. Being LLAMA based, I think this would …
-
**FairCLIP: Harnessing Fairness in Vision-Language Learning**
Paper Link: https://arxiv.org/abs/2403.19949
Code Link: https://github.com/Harvard-Ophthalmology-AI-Lab/FairCLIP
another paper on A…
-
### Checklist
- [X] I have searched the [existing issues](https://github.com/streamlit/streamlit/issues) for similar issues.
- [X] I added a very descriptive title to this issue.
- [X] I have provide…
-
Dear CogVLM's authors,
Thank you for your outstanding work on MLLM.
Can you share a bit about estimating the time required to fine-tune or train the model?
```
Hardware requirement
Model In…
-
Curious if MLLMs can work on it. I am already supposing LLAMA V1.5 can't . I can suggest checking out more efficient MLLM models like X-LLM
-
The idea of this work is very interesting!
While I have two confusions about the method:
(1) What's the ground truth caption of the image in Fig. 2? Is the word "feather" correct? (I am not sure…
-
如题。。如果pretrain就把图片切那么多份,训练成本是不是有些cover不住