Inference inputs multiple modalities other than text at once

csuhan / OneLLM

[CVPR 2024] OneLLM: One Framework to Align All Modalities with Language

Other

547 stars 26 forks source link

Inference inputs multiple modalities other than text at once #21

Open xxrbudong opened 3 months ago

xxrbudong commented 3 months ago

Hello, I would like to ask, the current code seems to support only one modality and text modality at a time of inference, is it possible to input multiple modal data (such as audio, video and text) at a time of inference?

csuhan commented 2 months ago

The current model is not trained on joint multimodal data, so it may not perform well at the test time.

Cece1031 commented 1 month ago

The current model is not trained on joint multimodal data, so it may not perform well at the test time. But I see you run the test on Music-AVQA in thesis, could u tell me how you manage to use three modalities to generate answers?Thank u very much!