SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
60 stars 56 forks source link

Create dataset loader for MM-Sum #518

Open SamuelCahyawijaya opened 5 months ago

SamuelCahyawijaya commented 5 months ago

Dataloader name: mm_sum/mm_sum.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?mm_sum

Dataset mm_sum
Description MM-Sum is the first large-scale Multilingual Multimodal Summarization dataset based on XLSum, a multilingual summarization dataset. The MM-Sum covers 44 languages with mid-high-, low- and zero-resource scenarios.
Subsets -
Languages ind, vie, mya, tha
Tasks Multimodal Summarization
License Unknown (unknown)
Homepage https://github.com/XL2248/SOV-MAS
HF URL -
Paper URL https://aclanthology.org/2023.acl-long.165/
akhdanfadh commented 5 months ago

Hey, this dataset provides these data for each language:

Not sure what seacrowd schema I should implement on this one as it is actually text2text with accompanying images in the url. If using image_text schema, maybe this mapping? Or do you have better idea like new schema instead?

id -> url
image_paths -> list of image_url
texts -> summary
metadata
├── context -> text
└── labels -> None

Btw, better homepage for the datasheet: https://github.com/XL2248/SOV-MAS

@holylovenia @SamuelCahyawijaya @sabilmakbar

akhdanfadh commented 5 months ago

self-assign

holylovenia commented 5 months ago

Not sure what seacrowd schema I should implement on this one as it is actually text2text with accompanying images in the url. If using image_text schema, maybe this mapping? Or do you have better idea like new schema instead?

id -> url
image_paths -> list of image_url
texts -> summary
metadata
├── context -> text
└── labels -> None

Btw, better homepage for the datasheet: https://github.com/XL2248/SOV-MAS

@holylovenia @SamuelCahyawijaya @sabilmakbar

Hi @akhdanfadh, the imtext schema implies that the context is additional, not required. But in this dataset, the contexts include both image and text, so I'm more inclined to have a separate schema (maybe something like imtext2t?).

What do you think, @sabilmakbar @SamuelCahyawijaya?

sabilmakbar commented 5 months ago

I think implementing a new schema of imtext2t is less scalable and a bit harder to interpret rather than the ones initially proposed by @akhdanfadh.

for the tag, do you mind giving a few examples to confirm our understanding? I think this one could be put in labels if it's quite informative and has 1:1 mapping to the image

akhdanfadh commented 5 months ago

I instead suggest modifying our current text2text schema to add metadata field, similar to how qa schema works. Thinking back to the main task which is summarization, we can think of the images as additional data here. Wdyt? @sabilmakbar @holylovenia

For discussion, I think it is a good idea to generalize metadata to all schema. No pressure, though.


for the tag, do you mind giving a few examples to confirm our understanding? I think this one could be put in labels if it's quite informative and has 1:1 mapping to the image

*.tag data example

0_afeb3192e879abbbcc452781cc10cd4acf4acf77_0    6_afeb3192e879abbbcc452781cc10cd4acf4acf77_6
13_b10a0cbef8d8dbcb6de36058b6d9148f4a43a8c3_0   19_b10a0cbef8d8dbcb6de36058b6d9148f4a43a8c3_6
26_f551d40ba73db3615350c9db952ec4d4cde4d246_0   27_f551d40ba73db3615350c9db952ec4d4cde4d246_1   28_f551d40ba73db3615350c9db952ec4d4cde4d246_2   34_f551d40ba73db3615350c9db952ec4d4cde4d246_8
...

is 1:1 mapping to these image_urls here (*.image data example)

https://www.bbc.com/news/uk-england-coventry-warwickshire-11714685  https://ichef.bbci.co.uk/news/304/mcs/media/images/49852000/jpg/_49852130_elecbus_091110_oov_0628bm-001.jpg https://ichef.bbci.co.uk/news/385/cpsprodpb/1101A/production/_123185696_gettyimages-1238276984.jpg
https://www.bbc.com/news/uk-england-beds-bucks-herts-11309150   https://ichef.bbci.co.uk/news/304/mcs/media/images/49106000/jpg/_49106357_2.jpg https://ichef.bbci.co.uk/news/385/cpsprodpb/1101A/production/_123185696_gettyimages-1238276984.jpg
https://www.bbc.com/news/magazine-24338387  https://ichef.bbci.co.uk/news/304/mcs/media/images/70200000/jpg/_70200876_ragout3_304.jpg   https://ichef.bbci.co.uk/news/304/mcs/media/images/70217000/jpg/_70217221_cherryblossom.jpg https://ichef.bbci.co.uk/news/304/mcs/media/images/70217000/jpg/_70217224_gilead.jpg    https://ichef.bbci.co.uk/news/385/cpsprodpb/1101A/production/_123185696_gettyimages-1238276984.jpg
...
holylovenia commented 5 months ago

Thinking back to the main task which is summarization, we can think of the images as additional data here. Wdyt? @sabilmakbar @holylovenia

Sure, I agree that in this case t2t with meta is more appropriate.

For discussion, I think it is a good idea to generalize meta to all schema.

I agree with you. We would have to change the previous dataloaders to assign an empty dict to the meta variable though.

What do you think, @sabilmakbar @akhdanfadh?

akhdanfadh commented 4 months ago

We would have to change the previous dataloaders to assign an empty dict to the meta variable though.

It can be for future work IMO.


@holylovenia

The dataset turns out to be not consistent. Using train split for example, so we have train.image, train.target, train.tag, and train.source files. These files are meant to be read line by line corresponding to each instances, BUT the lines are not the same.

In case of Indonesian subset, Got: image=36163, source=36161, target=36161, tag=36163. while vietnamese subset, Got: image=18816, source=18811, target=18811, tag=18816.

Implemented dataloader ( this should be it ig :) ) ```python # coding=utf-8 # Copyright 2022 The HuggingFace Datasets Authors and the current dataset script contributor. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. from pathlib import Path from typing import Dict, List, Tuple import datasets from seacrowd.utils.configs import SEACrowdConfig from seacrowd.utils.constants import (SCHEMA_TO_FEATURES, TASK_TO_SCHEMA, Licenses, Tasks) _CITATION = """\ @inproceedings{liang-etal-2023-summary, title = "Summary-Oriented Vision Modeling for Multimodal Abstractive Summarization", author = "Liang, Yunlong and Meng, Fandong and Xu, Jinan and Wang, Jiaan and Chen, Yufeng and Zhou, Jie", editor = "Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki", booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.acl-long.165", doi = "10.18653/v1/2023.acl-long.165", pages = "2934--2951", } """ _DATASETNAME = "mm_sum" _DESCRIPTION = """\ MM-Sum is the first large-scale Multilingual Multimodal Summarization dataset based on XLSum, a multilingual summarization dataset. The MM-Sum covers 44 languages with mid-high-, low- and zero-resource scenarios. """ _HOMEPAGE = "https://github.com/XL2248/SOV-MAS" _LANGUAGES = ["ind", "vie", "mya", "tha"] _LICENSE = Licenses.UNKNOWN.value _LOCAL = False _URLS = { _DATASETNAME: "https://drive.google.com/file/d/1h-vWFQaZyOu_jbr6thwUWbzW93fOke0i/view", } _SUPPORTED_TASKS = [Tasks.SUMMARIZATION] # multimodal in this dataset _SEACROWD_SCHEMA = f"seacrowd_{TASK_TO_SCHEMA[_SUPPORTED_TASKS[0]].lower()}" # sptext _SOURCE_VERSION = "1.0.0" _SEACROWD_VERSION = "1.0.0" class NewDataset(datasets.GeneratorBasedBuilder): """Large-scale Multilingual Multimodal Summarization dataset based on XLSum""" SOURCE_VERSION = datasets.Version(_SOURCE_VERSION) SEACROWD_VERSION = datasets.Version(_SEACROWD_VERSION) SUBSETS = { "ind": "high-resource/indonesian", "vie": "high-resource/vietnamese", "mya_zeroshot": "zero-shot/burmese", "tha_zeroshot": "zero-shot/thai", "mya_fewshot": "few-shot/burmese", "tha_fewshot": "few-shot/thai", } BUILDER_CONFIGS = [] for subset in SUBSETS: BUILDER_CONFIGS += [ SEACrowdConfig( name=f"{_DATASETNAME}_{subset}_source", version=SOURCE_VERSION, description=f"{_DATASETNAME} {subset} source schema", schema="source", subset_id=subset, ), SEACrowdConfig( name=f"{_DATASETNAME}_{subset}_{_SEACROWD_SCHEMA}", version=SEACROWD_VERSION, description=f"{_DATASETNAME} {subset} SEACrowd schema", schema=_SEACROWD_SCHEMA, subset_id=subset, ), ] DEFAULT_CONFIG_NAME = f"{_DATASETNAME}_ind_source" def _info(self) -> datasets.DatasetInfo: if self.config.schema == "source": features = datasets.Features( { "text": datasets.Value("string"), "summary": datasets.Value("string"), "url": datasets.Value("string"), "image": datasets.Sequence( { "id": datasets.Value("string"), "url": datasets.Value("string"), } ), } ) elif self.config.schema == _SEACROWD_SCHEMA: features = SCHEMA_TO_FEATURES[TASK_TO_SCHEMA[_SUPPORTED_TASKS[0]]] # text2text_features features["images"] = datasets.Sequence( { "id": datasets.Value("string"), "url": datasets.Value("string"), } ) return datasets.DatasetInfo( description=_DESCRIPTION, features=features, homepage=_HOMEPAGE, license=_LICENSE, citation=_CITATION, ) def _create_split_generator(self, split_name: str, data_dir: Path): image_file = data_dir / f"{split_name}.image" source_file = data_dir / f"{split_name}.source" tag_file = data_dir / f"{split_name}.tag" target_file = data_dir / f"{split_name}.target" if split_name == "test": split = datasets.Split.TEST elif split_name == "val": split = datasets.Split.VALIDATION else: split = datasets.Split.TRAIN return datasets.SplitGenerator( name=split, gen_kwargs={"image_file": image_file, "source_file": source_file, "tag_file": tag_file, "target_file": target_file, "split": split_name}, ) def _split_generators(self, dl_manager: datasets.DownloadManager) -> List[datasets.SplitGenerator]: """Returns SplitGenerators.""" # check if gdown is installed try: import gdown except ImportError as err: raise ImportError("You need to have gdown installed to download the dataset. You can install it with `pip install gdown`.") from err # download and extract the dataset output_dir = Path.cwd() / "data" / "mm_sum" output_dir.mkdir(parents=True, exist_ok=True) zip_file = output_dir / "SOV-MAS-data.zip" gdown.download(url=_URLS[_DATASETNAME], output=str(zip_file), fuzzy=True, resume=True) data_dir = Path(dl_manager.extract(zip_file)) / "SOV-MAS-data" / self.SUBSETS[self.config.subset_id] # handle split: no train data for zero-shot split_generator = [] for split in ["val", "test"]: split_generator.append(self._create_split_generator(split, data_dir)) if "zero" not in self.config.subset_id: split_generator.append(self._create_split_generator("train", data_dir)) return split_generator def _generate_examples(self, image_file: Path, source_file: Path, tag_file: Path, target_file: Path, split: str) -> Tuple[int, Dict]: """Yields examples as (key, example) tuples.""" # load files with open(image_file, "r", encoding="utf-8") as f: image_data = f.readlines() with open(source_file, "r", encoding="utf-8") as f: source_data = f.readlines() with open(target_file, "r", encoding="utf-8") as f: target_data = f.readlines() with open(tag_file, "r", encoding="utf-8") as f: tag_data = f.readlines() assert ( len(image_data) == len(source_data) == len(target_data) == len(tag_data) ), f"The number of lines in {split} files should be the same. Got: image={len(image_data)}, source={len(source_data)}, target={len(target_data)}, tag={len(tag_data)}." for idx, (image, source, target, tag) in enumerate(zip(image_data, source_data, target_data, tag_data)): image = image.strip().split("\t") url = image.pop(0) source = source.strip() target = target.strip() tag = tag.strip().split("\t") if self.config.schema == "source": images = [{"id": image_id, "url": image_url} for image_id, image_url in zip(tag, image)] yield idx, { "text": source, "summary": target, "url": url, "image": images, } if self.config.schema == _SEACROWD_SCHEMA: yield idx, { "text_1": source, "text_2": target, "text_1_name": "text", "text_2_name": "summary", "images": [{"id": image_id, "url": image_url} for image_id, image_url in zip(tag, image)], } ```
holylovenia commented 4 months ago

The dataset turns out to be not consistent. Using train split for example, so we have train.image, train.target, train.tag, and train.source files. These files are meant to be read line by line corresponding to each instances, BUT the lines are not the same.

In case of Indonesian subset, Got: image=36163, source=36161, target=36161, tag=36163. while vietnamese subset, Got: image=18816, source=18811, target=18811, tag=18816.

Would it be possible to identify which data instances' attributes are skipped, @akhdanfadh? If it is, let's just skip those corresponding data instances so all the loaded variables are consistent.

akhdanfadh commented 4 months ago

Would it be possible to identify which data instances' attributes are skipped?

@holylovenia There is no ID in each line for every file, so the short answer is no. Unless we want to scrap for every article URL given, check maybe the first few words, and match it with the source text given, and wow I'm not sure we need to do that.