SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
65 stars 57 forks source link

Closes #269 | Create dataset loader for ViVQA #269 #318

Closed Gyyz closed 7 months ago

Gyyz commented 9 months ago

Closes #269

Checkbox

danjohnvelasco commented 8 months ago

Hi @Gyyz 👋, thank you for the PR! I've tested your code and everything is working as expected. Just one small thing, can you run make check_file=seacrowd/sea_datasets/vivqa/vivqa.py and push the formatted code? Thanks!

holylovenia commented 8 months ago

@Gyyz @danjohnvelasco This dataloader's task is supposed to be VQA instead of text-only QA. Could we adjust the schema to imqa and the supported task to Tasks.VISUAL_QUESTION_ANSWERING by incorporating #380?

Gyyz commented 8 months ago

@danjohnvelascohttps://github.com/danjohnvelasco I am in China and on vocation, don’t have a good network to access the github, will update the repo when back after 2.18.

From: Holy Lovenia @.> Date: Thursday, 25 January 2024 at 13:05 To: SEACrowd/seacrowd-datahub @.> Cc: Yuze GAO @.>, Mention @.> Subject: Re: [SEACrowd/seacrowd-datahub] Closes #269 | Create dataset loader for ViVQA #269 (PR #318)

@Gyyzhttps://github.com/Gyyz @danjohnvelascohttps://github.com/danjohnvelasco This dataloader's task is supposed to be VQA instead of text-only QA. Could we adjust the schema to imqa and the supported task to Tasks.VISUAL_QUESTION_ANSWERING by incorporating #380https://github.com/SEACrowd/seacrowd-datahub/pull/380?

— Reply to this email directly, view it on GitHubhttps://github.com/SEACrowd/seacrowd-datahub/pull/318#issuecomment-1909362069, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFOFKPJLKR5T4GJWNAJB2PTYQHRZ3AVCNFSM6AAAAABBXKWMD6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBZGM3DEMBWHE. You are receiving this because you were mentioned.Message ID: @.***>

github-actions[bot] commented 7 months ago

Hi @ & @, may I know if you are still working on this PR?

Gyyz commented 7 months ago

Hi @ & @, may I know if you are still working on this PR?

Seems I have already fixed the request in my latest commit. Can you have a check? ☺️

danjohnvelasco commented 7 months ago

LGTM! The code is running on my end. Let's wait for @holylovenia's approval before merging.

Gyyz commented 7 months ago

Hi @Gyyz, thank you for your contribution! Could you please take a look at my suggestions?

Sure, I will update it shortly.

Gyyz commented 7 months ago

Hi @Gyyz, please let us know once the PR is ready for another round of review. 🙏

Sorry, the dataloader is ready, I am testing the local image path, I need a fast network environment to load the 13GB data.

Gyyz commented 7 months ago

Scripts passed, please check

(ani) 
# yuz @ LLL in ~/workspace/seacrowd-datahub on git:vivqa x [10:58:01] 
$ python -m tests.test_seacrowd seacrowd/sea_datasets/vivqa/vivqa.py
INFO:__main__:args: Namespace(path='seacrowd/sea_datasets/vivqa/vivqa.py', schema=None, subset_id=None, data_dir=None, use_auth_token=None)
INFO:__main__:self.PATH: seacrowd/sea_datasets/vivqa/vivqa.py
INFO:__main__:self.SUBSET_ID: vivqa
INFO:__main__:self.SCHEMA: None
INFO:__main__:self.DATA_DIR: None
INFO:__main__:Checking for _SUPPORTED_TASKS ...
module seacrowd.sea_datasets.vivqa.vivqa
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.VISUAL_QUESTION_ANSWERING: 'VQA'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'IMQA'}
INFO:__main__:schemas_to_check: {'IMQA'}
INFO:__main__:Checking load_dataset with config name vivqa_source
/Users/yuz/anaconda3/envs/ani/lib/python3.10/site-packages/datasets/load.py:2479: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
/Users/yuz/anaconda3/envs/ani/lib/python3.10/site-packages/datasets/load.py:922: FutureWarning: The repository for vivqa contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at seacrowd/sea_datasets/vivqa/vivqa.py
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
INFO:__main__:Checking load_dataset with config name vivqa_seacrowd_imqa
/Users/yuz/anaconda3/envs/ani/lib/python3.10/site-packages/datasets/load.py:2479: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
/Users/yuz/anaconda3/envs/ani/lib/python3.10/site-packages/datasets/load.py:922: FutureWarning: The repository for vivqa contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at seacrowd/sea_datasets/vivqa/vivqa.py
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
INFO:__main__:Dataset sample [source]
{'img_id': '68857', 'question': 'màu của chiếc bình là gì', 'answer': 'màu xanh lá', 'type': '2', 'coco_url': 'http://images.cocodataset.org/train2014/COCO_train2014_000000068857.jpg', 'flickr_url': 'http://farm2.staticflickr.com/1433/5167151338_3e40df8f91_z.jpg', 'img_name': 'COCO_train2014_000000068857.jpg', 'coco_license': 4, 'coco_width': 640, 'coco_height': 536, 'coco_date_captured': '2013-11-24 13:04:24', 'image_path': '/Users/yuz/.cache/huggingface/datasets/downloads/extracted/d7ee93c5904043a54a13b9b98c9a044e20aa3c6cd1a6b349808657be0b0a9865/train2014/COCO_train2014_000000068857.jpg'}
INFO:__main__:Dataset sample [seacrowd_imqa]
{'id': '0', 'question_id': '0', 'document_id': '0', 'questions': ['màu của chiếc bình là gì'], 'type': None, 'choices': None, 'context': None, 'answer': ['màu xanh lá'], 'image_paths': ['/Users/yuz/.cache/huggingface/datasets/downloads/extracted/d7ee93c5904043a54a13b9b98c9a044e20aa3c6cd1a6b349808657be0b0a9865/train2014/COCO_train2014_000000068857.jpg'], 'meta': {'coco_img_id': '68857', 'type': '2', 'flickr_url': 'http://farm2.staticflickr.com/1433/5167151338_3e40df8f91_z.jpg', 'coco_url': 'http://images.cocodataset.org/train2014/COCO_train2014_000000068857.jpg', 'img_name': 'COCO_train2014_000000068857.jpg', 'coco_license': 4, 'coco_width': 640, 'coco_height': 536, 'coco_date_captured': '2013-11-24 13:04:24', 'image_path': '/Users/yuz/.cache/huggingface/datasets/downloads/extracted/d7ee93c5904043a54a13b9b98c9a044e20aa3c6cd1a6b349808657be0b0a9865/train2014/COCO_train2014_000000068857.jpg'}}
INFO:__main__:Checking global ID uniqueness
INFO:__main__:Found 3001 unique IDs
INFO:__main__:Gathering schema statistics
INFO:__main__:Gathering schema statistics
train
==========
id: 11999
question_id: 11999
document_id: 11999
questions: 11999
answer: 11999
image_paths: 11999
meta: 119990

test
==========
id: 3001
question_id: 3001
document_id: 3001
questions: 3001
answer: 3001
image_paths: 3001
meta: 30010

.
----------------------------------------------------------------------
Ran 1 test in 1.726s

OK
(ani) 
# yuz @ LLL in ~/workspace/seacrowd-datahub on git:vivqa x [10:58:08] 
$ make check_file=seacrowd/sea_datasets/vivqa/vivqa.py
black --line-length 250 --target-version py38 seacrowd/sea_datasets/vivqa/vivqa.py
reformatted seacrowd/sea_datasets/vivqa/vivqa.py

All done! ✨ 🍰 ✨
1 file reformatted.
isort seacrowd/sea_datasets/vivqa/vivqa.py
Fixing /Users/yuz/workspace/seacrowd-datahub/seacrowd/sea_datasets/vivqa/vivqa.py
flake8 seacrowd/sea_datasets/vivqa/vivqa.py --max-line-length 250
(ani)