Create dataset loader for MaXM

SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.

Apache License 2.0

60 stars 56 forks source link

Create dataset loader for MaXM #425

Closed SamuelCahyawijaya closed 3 months ago

SamuelCahyawijaya commented 6 months ago

Dataloader name: maxm/maxm.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?maxm

Dataset	maxm
Description	MaXM, a test-only VQA benchmark in 7 diverse languages, including Thai. The dataset is generated by first applying a translation-based framework to mVQA and then applying framework to the multilingual captions in the Crossmodal-3600 dataset.
Subsets	MaXM v1 -th
Languages	tha
Tasks	Question Answering
License	Other (other)
Homepage	https://github.com/google-research-datasets/maxm
HF URL	-
Paper URL	https://aclanthology.org/2023.findings-emnlp.176

akhdanfadh commented 6 months ago

Hi, the dataset is organized as follows:

dataset                 str: dataset name
version                 str: dataset version
split                   str: language ID
annotations             List of image-question-answers triplets, each of which is
-- image_id             str: image ID
-- image_url            str: image URL
-- qa_pairs             List of question-answer pairs, each of which is
---- question_id        str: question ID
---- question           str: raw question
---- answers            List of str: ground-truth answers
---- processed_answers  List of str: processed ground-truth answers. 16 tokenized answers.
---- is_collection      bool: "true" if the question is of the "Collection" type; "false" otherwise..

In question answering schema, the features are:

id             (str)
question_id    (str)
document_id    (str)
question       (str)
type           (str)
choices        (list[str])
context        (str)
answer         (list[str])
meta           (dict[Any])

Should I assign is_collection to type, context, or inside meta?
Also, should I put image_id or image_url for the document_id?

github-actions[bot] commented 6 months ago

Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

akhdanfadh commented 6 months ago

Hmm, I think I need to mention for faster response @sabilmakbar @holylovenia

holylovenia commented 5 months ago

I didn't realize I missed so many mentions from you. 😭 Sorry!!

Could you please use Tasks.VISUAL_QUESTION_ANSWERING? It employs the imqa schema.

Should I assign is_collection to type, context, or inside meta?

Inside meta would be perfect. type is typically open-ended, multiple-choice, extractive, abstractive, etc.

Also, should I put image_id or image_url for the document_id?

document_id is related to the context (if there is).