SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
68 stars 57 forks source link

Create dataset loader for CLIRMatrix #426

Closed SamuelCahyawijaya closed 5 months ago

SamuelCahyawijaya commented 9 months ago

Dataloader name: clir_matrix/clir_matrix.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?clir_matrix

Dataset clir_matrix
Description CLIRMatrix is a massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval extracted automatically from Wikipedia. CLIRMatrix comprises (1) BI-139, a bilingual dataset of queries in one language matched with relevant documents in another language for 139x138=19,182 language pairs, and (2) MULTI-8, a multilingual dataset of queries and documents jointly aligned in 8 different languages.
Subsets -
Languages tgl, ilo, min, jav, sun, ceb, vie, tha
Tasks Text Retrieval
License Unknown (unknown)
Homepage https://github.com/ssun32/CLIRMatrix
HF URL -
Paper URL https://aclanthology.org/2020.emnlp-main.340/
fhudi commented 9 months ago

self-assign

github-actions[bot] commented 8 months ago

Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

fhudi commented 8 months ago

To whoever it may concern, in regards to the above github-actions' reminder, I apologise for the delay and any inconveniences caused by this. I am currently still working on this issue, please give some more time. Regards.

akhdanfadh commented 7 months ago

@fhudi may we know if you are still working on this issue? It has already been one month since your last update.

fhudi commented 7 months ago

Hi @akhdanfadh, thanks for the reminder.

There are some files in the raw dataset that turned out to be empty file. I was in the process of downloading the whole combination from the 9 languages supported, to check and then ask the author for clarification, but somehow forgotten half-way.


Hi @ssun32, I tried to create dataloader for your dataset but seems like there are empty files. Could you please help checking the BI-139 for all queries in Indonesian (id), i.e. id → *? 🙏

image

And also, regarding the license, it seems to be of unknown value, but from your CLIRMatrix site, it seems to be cc-by-4.0 as written in the footnote. So which one is correct? 🙏

ssun32 commented 7 months ago

@fhudi It turns out there is zero overlap of the id-xx examples with the examples in the other language directions, probably due to incomplete Wikidata entries for ID when I created the dataset a few years ago. I recommend throwing away the language directions with empty files. Thanks for spotting the issue!

fhudi commented 7 months ago

thanks @ssun32. What about the license?


@SamuelCahyawijaya, Need help 🙏 Shall we just remove the support to language ind?

holylovenia commented 7 months ago

@fhudi Removed ind in both issue ticket and datasheet.

fhudi commented 7 months ago

@holylovenia @SamuelCahyawijaya


The dataset's task is TEXT_RETRIEVAL, so the seacrowd schema for this dataset is PAIRS as noted in the constants.py

However, it seems the schema is incorrect, as it seems the triplet contains a discrete numerical value for relevance score as defined follows: image

Although PAIRS_SCORE seems to be more fitting, the dataset is formatted based on the task of IR (Information Retrieval), where multiple document ids and it's relevance score in a single triplet. Note that PAIRS_MULTI is categorical hence unfit. image


I think we are not going to support the format of IR task, right? Because if we do, it will be problematic to load the whole document texts instead of IDs in the dataloader.

One of the immediate solution that I can think of, without much changes to the shared classes, is letting the relevance score represented as categorical.

But WDYT?

holylovenia commented 7 months ago

Hi @fhudi, I agree with you. Let's do source-only for this dataloader. 👍