Crossmodal-3600 dataset (XM3600 in short), a geographically-diverse set of 3600 images annotated with human-generated reference captions in 36 languages. The images were selected from across the world, covering regions where the languages are spoken, and annotated with captions that achieve consistency in terms of style across all languages, while avoiding annotation artifacts due to direct translation. The languages covered in the dataset include Filipino, Indonesian, Thai, and Vietnamnese
Dataloader name:
xm3600/xm3600.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?xm3600