-
This issue is an overview of tasks to add for a massive multimodal extension of MTEB. The modalities are:
- T=Text
- I=Image
- A=Audio
- V=Video without audio i.e. just multiple images
Below is…
-
According to readme, this is the code for training:
```
(llama3-ft) python train.py --dataset_path path/to/dataset.json --output_dir path/to/output_dir --text_model_id="meta-llama/Meta-Llama-3-8B-I…
-
## Description
I'm looking to do my dissertation on the topic of "Expanding AutoGluon-Multimodal to Incorporate Audio: Enhancing AutoML with Voice Data for Multimodal Machine Learning"
I was wonde…
-
### Your current environment
I am running vllm serve with a multimodal (Phi3.5K). How to I run benchmark_serving.py to test the multimodal?
In benchmark_serving.py file I see following but test_mm…
-
After pretraining the model on WebVid, the MSRVTT evaluation results dropped to below 1%. Similarly, when pretraining from the provided pretrained weights, the results also dropped below 1% after the …
-
**Describe the feature**
I have noticed that not all multimodal available here in ms-swift support multi-image, and if they do, the training code might not support it. It is also the case with mix te…
-
Multimodal has been removed since https://github.com/ggerganov/llama.cpp/pull/5882
Depends on the refactoring of `llava`, we will be able to bring back the support: https://github.com/ggerganov/lla…
-
Hello,
I've been trying to qwen2 0.5B and tinyclip using the repository, but I'm running into CUDA OOM issues on the dense2dense distillation step. Im running on 4 80GB A100s, I was wondering if I …
-
### Is your feature request related to a problem? Please describe.
In version 20.11.0 ALVR added Multimodal tracking support, allowing fingers to be tracked while holding the controllers (both fing…
-
We current have `multimodal_chat_dataset` which is great for conversations on an image, but many VQA datasets are structured more like instructions where there is a question column, answer column, and…