Closed crqcrq001 closed 10 months ago
The format is the same as in ALBEF and VindLU.
image_data = [
{'image': image_path, 'caption': caption_content},
{'image': image_path, 'caption': caption_content},
]
video_data = [
{'video': video_path, 'caption': caption_content},
{'video': video_path, 'caption': caption_content},
]
You can find the example here.
Sorry I am new to this repo. If I want to reproduce stage1, how can I prepare training dataset.Similar question as
https://github.com/OpenGVLab/Ask-Anything/issues/46