IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.

Apache License 2.0

261 stars 61 forks source link

Create dataset loader for Unimorph ID #44

Closed SamuelCahyawijaya closed 1 year ago

SamuelCahyawijaya commented 2 years ago

https://indonlp.github.io/nusa-catalogue/card.html?unimorph_id

fhudi commented 2 years ago

self-assign

bryanwilie commented 2 years ago

Hi @fhudi, are you still working on this? I will assume inactivity if there's no reply and will free the assignees. Thanks!

fhudi commented 2 years ago

Hi @bryanwilie thanks for asking, I discussed this with @gentaiscool and @afaji long time ago, based on the discussion it seems new schema might be required. I thought there would similar situations for other datasets and was expecting a flow of proposing new schema to be released. But it does not seem to be case as your comment suggested otherwise. Is there any way to propose new schema or should I just create new schema alongside?

bryanwilie commented 2 years ago

Noted @fhudi. Looping @holylovenia since it's related to proposing for new schema.

Thank you for joining us by the way!

holylovenia commented 2 years ago

Hello @fhudi, thank you for waiting. For the nusantara schema, could you please use the t2t schema (and Tasks.PARAPHRASING) with the form as the text1 and the lemma as text2? For the source schema, you can implement it according to the original dataset structure, so the features will be: lemma: string, form: string, tag: [string]. Please let me know if you have any questions.

fhudi commented 2 years ago

Hi @holylovenia, thanks for the reply.

Sorry but I don't quite get it, could you please elaborate more 🙏

Firstly, the t2t schema does not have tag field for the crucial inflection element, CMIIW.
Secondly, the tag is list of labels, cmiiw but is nusantara not going to support this?

Let's take an example as follows.

Following the paraphrasing task, in this particular example, same input text abdi has 2 different outputs [abdinya, mengabdi], is this fine?

I believe what you mentioned was specifically for Morphological Analysis task minus the inflection part becoming Paraphrasing task as a result.

And what about Morphological Inflection task, i.e.: (in) abdi ['V', 'ACT'] → (out) mengabdi, are we not going to support these morphological tasks in Nusantara?

holylovenia commented 2 years ago

Hi @fhudi, thank you for waiting and explaining. What you said is right, it is quite inaccurate to frame this morphological inflection as a paraphrasing task. However, so far there hasn't been a demand for this schema structure aside from this dataloader, so we decide to leave the nusantara schema out of it for now. Please implement the source schema only. Thanks again! :smile:

fhudi commented 2 years ago

@holylovenia Noted and thanks, will do so 😄 @afaji FYI, morphology-related task won't be implemented for now 🙏

muhsatrio commented 1 year ago

Hi kak @fhudi sorry I reopened it because just missed one thing. Can you change the location of unimorph_id.py file to nusacrowd/nusa_datasets/unimorph_id/unimorph_id.py? Because another datasets had been moved to that location too. Can raise another PR again to fix it. Thank you!

muhsatrio commented 1 year ago

Hi kak @fhudi sorry I reopened it because just missed one thing. Can you change the location of unimorph_id.py file to nusacrowd/nusa_datasets/unimorph_id/unimorph_id.py? Because another datasets had been moved to that location too. Can raise another PR again to fix it. Thank you!

Closed again since it had been resolved in https://github.com/IndoNLP/nusa-crowd/commit/3df7f5b8e89112cc17d4a7466441ac41e9b8fa87 by @holylovenia