Open SamuelCahyawijaya opened 8 months ago
Hey, this dataset provides these data for each language:
*.image
: space-separated-value, url
then several image_url
s*.source
: text
in the url
*.tag
: tag
for each image_url
*.target
: the output summary
Not sure what seacrowd schema I should implement on this one as it is actually text2text with accompanying images in the url. If using image_text schema, maybe this mapping? Or do you have better idea like new schema instead?
id -> url
image_paths -> list of image_url
texts -> summary
metadata
├── context -> text
└── labels -> None
Btw, better homepage for the datasheet: https://github.com/XL2248/SOV-MAS
@holylovenia @SamuelCahyawijaya @sabilmakbar
Not sure what seacrowd schema I should implement on this one as it is actually text2text with accompanying images in the url. If using image_text schema, maybe this mapping? Or do you have better idea like new schema instead?
id -> url image_paths -> list of image_url texts -> summary metadata ├── context -> text └── labels -> None
Btw, better homepage for the datasheet: https://github.com/XL2248/SOV-MAS
@holylovenia @SamuelCahyawijaya @sabilmakbar
Hi @akhdanfadh, the imtext
schema implies that the context
is additional, not required. But in this dataset, the contexts include both image
and text
, so I'm more inclined to have a separate schema (maybe something like imtext2t
?).
What do you think, @sabilmakbar @SamuelCahyawijaya?
I think implementing a new schema of imtext2t
is less scalable and a bit harder to interpret rather than the ones initially proposed by @akhdanfadh.
for the tag,
do you mind giving a few examples to confirm our understanding? I think this one could be put in labels
if it's quite informative and has 1:1 mapping to the image
I instead suggest modifying our current text2text schema to add metadata
field, similar to how qa schema works. Thinking back to the main task which is summarization, we can think of the images as additional data here. Wdyt? @sabilmakbar @holylovenia
For discussion, I think it is a good idea to generalize metadata
to all schema. No pressure, though.
for the
tag,
do you mind giving a few examples to confirm our understanding? I think this one could be put inlabels
if it's quite informative and has 1:1 mapping to the image
*.tag
data example
0_afeb3192e879abbbcc452781cc10cd4acf4acf77_0 6_afeb3192e879abbbcc452781cc10cd4acf4acf77_6
13_b10a0cbef8d8dbcb6de36058b6d9148f4a43a8c3_0 19_b10a0cbef8d8dbcb6de36058b6d9148f4a43a8c3_6
26_f551d40ba73db3615350c9db952ec4d4cde4d246_0 27_f551d40ba73db3615350c9db952ec4d4cde4d246_1 28_f551d40ba73db3615350c9db952ec4d4cde4d246_2 34_f551d40ba73db3615350c9db952ec4d4cde4d246_8
...
is 1:1 mapping to these image_url
s here (*.image
data example)
https://www.bbc.com/news/uk-england-coventry-warwickshire-11714685 https://ichef.bbci.co.uk/news/304/mcs/media/images/49852000/jpg/_49852130_elecbus_091110_oov_0628bm-001.jpg https://ichef.bbci.co.uk/news/385/cpsprodpb/1101A/production/_123185696_gettyimages-1238276984.jpg
https://www.bbc.com/news/uk-england-beds-bucks-herts-11309150 https://ichef.bbci.co.uk/news/304/mcs/media/images/49106000/jpg/_49106357_2.jpg https://ichef.bbci.co.uk/news/385/cpsprodpb/1101A/production/_123185696_gettyimages-1238276984.jpg
https://www.bbc.com/news/magazine-24338387 https://ichef.bbci.co.uk/news/304/mcs/media/images/70200000/jpg/_70200876_ragout3_304.jpg https://ichef.bbci.co.uk/news/304/mcs/media/images/70217000/jpg/_70217221_cherryblossom.jpg https://ichef.bbci.co.uk/news/304/mcs/media/images/70217000/jpg/_70217224_gilead.jpg https://ichef.bbci.co.uk/news/385/cpsprodpb/1101A/production/_123185696_gettyimages-1238276984.jpg
...
Thinking back to the main task which is summarization, we can think of the images as additional data here. Wdyt? @sabilmakbar @holylovenia
Sure, I agree that in this case t2t
with meta
is more appropriate.
For discussion, I think it is a good idea to generalize
meta
to all schema.
I agree with you. We would have to change the previous dataloaders to assign an empty dict
to the meta
variable though.
What do you think, @sabilmakbar @akhdanfadh?
We would have to change the previous dataloaders to assign an empty
dict
to themeta
variable though.
It can be for future work IMO.
@holylovenia
The dataset turns out to be not consistent. Using train split for example, so we have train.image
, train.target
, train.tag
, and train.source
files. These files are meant to be read line by line corresponding to each instances, BUT the lines are not the same.
In case of Indonesian subset, Got: image=36163, source=36161, target=36161, tag=36163.
while vietnamese subset, Got: image=18816, source=18811, target=18811, tag=18816.
The dataset turns out to be not consistent. Using train split for example, so we have
train.image
,train.target
,train.tag
, andtrain.source
files. These files are meant to be read line by line corresponding to each instances, BUT the lines are not the same.In case of Indonesian subset,
Got: image=36163, source=36161, target=36161, tag=36163.
while vietnamese subset,Got: image=18816, source=18811, target=18811, tag=18816.
Would it be possible to identify which data instances' attributes are skipped, @akhdanfadh? If it is, let's just skip those corresponding data instances so all the loaded variables are consistent.
Would it be possible to identify which data instances' attributes are skipped?
@holylovenia There is no ID in each line for every file, so the short answer is no. Unless we want to scrap for every article URL given, check maybe the first few words, and match it with the source text given, and wow I'm not sure we need to do that.
Dataloader name:
mm_sum/mm_sum.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?mm_sum