SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.

Apache License 2.0

65 stars 57 forks source link

Create dataset loader for IndoWiki #347

Closed SamuelCahyawijaya closed 6 months ago

SamuelCahyawijaya commented 8 months ago

Dataloader name: indowiki/indowiki.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?indowiki

Dataset	indowiki
Description	IndoWiki is a knowledge-graph dataset taken from WikiData and aligned with Wikipedia Bahasa Indonesia as it's corpus.
Subsets	-
Languages	ind
Tasks	Knowledge Base
License	MIT (mit)
Homepage	https://github.com/IgoRamli/IndoWiki
HF URL	-
Paper URL	https://ieeexplore.ieee.org/document/9924844

zwenyu commented 7 months ago

self-assign

sabilmakbar commented 7 months ago

Hi @holylovenia @SamuelCahyawijaya. This dataset only contains entity pairs and its relation w/o any passage information (hence can't be put on RELATION_EXTRACTION task). Do you think we should omit the SEACrowd Schema Implementation or proceed to create another task for this? (likely will be using SEACrowd KB)

holylovenia commented 6 months ago

Hi @holylovenia @SamuelCahyawijaya. This dataset only contains entity pairs and its relation w/o any passage information (hence can't be put on RELATION_EXTRACTION task). Do you think we should omit the SEACrowd Schema Implementation or proceed to create another task for this? (likely will be using SEACrowd KB)

Will a task using the pairs schema be suitable, @zwenyu @sabilmakbar?

sabilmakbar commented 6 months ago

@holylovenia In typical KB's format, it comes on a triplet of (Subject/Entity 1, Predicate/Relation, and Object/Entity 2). The difficulty of using pairs schema is the possibilities of Predicate values can be either unknown -- there could be infinite choices of predicates -- or simply too many classes needede to be defined in Classlabel name.

FYI this dataset has 935 possible relations values, which is tedious and seemingly impossible as well to write it all down w/o having to iterate all values on dataset

If we really want to make a SEACrowd Schema out of it (which I prefer not to because of the task itself is not particularly useful in NLP-related world) I suggest creating a triplet-based schema which is similar to pairs schema, just having all of the columns as string (and possibly we can store both its values and its ID -- if the source dataset has it)

zwenyu commented 6 months ago

@holylovenia @sabilmakbar I've added a triplets schema and updated the PR. Can you check if it's ok?

sabilmakbar commented 6 months ago

@holylovenia @sabilmakbar I've added a triplets schema and updated the PR. Can you check if it's ok?

I've checked it, and it looks okay (except the config PR should be separated from the dataloader one).

But one thing that worries me is whether we really need this SEACrowd triplet schema for a KB dataset that is less used in LLM/LMM development and evaluation.

holylovenia commented 6 months ago

Hello @sabilmakbar @zwenyu, apparently this dataloader utilizes a niche schema, so I think implementing only the source schema is enough. No need for the seacrowd schema.

zwenyu commented 6 months ago

@holylovenia @sabilmakbar Noted. I've reverted the changes and removed the seacrowd schema.