We use multimodal captioning and generation to evaluate our pre-trained models, as these tasks have been set as pre-training objectives. Since our instruction dataset, AnyInstruct, does not incorporate a general VQA dataset, it has not been evaluated on VQA tasks. Our constructed AnyInstruct dataset focuses more on general dialogue with arbitrary modality combinations, to demonstrate that multiple modalities can be compatible within a single model. The corresponding capabilities are showcased in the demo at https://junzhan2000.github.io/AnyGPT.github.io/.
We use multimodal captioning and generation to evaluate our pre-trained models, as these tasks have been set as pre-training objectives. Since our instruction dataset, AnyInstruct, does not incorporate a general VQA dataset, it has not been evaluated on VQA tasks. Our constructed AnyInstruct dataset focuses more on general dialogue with arbitrary modality combinations, to demonstrate that multiple modalities can be compatible within a single model. The corresponding capabilities are showcased in the demo at https://junzhan2000.github.io/AnyGPT.github.io/.