Closes #46 | Indonesian WSD data loader

muhsatrio commented 1 year ago

Closes #46

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script nusacrowd/nusa_datasets/indonesian_wsd/indonesian_wsd.py (please use only lowercase and underscore for dataset naming).
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _NUSANTARA_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one NusantaraConfig for the source schema and one for a nusantara schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_nusantara --path=nusacrowd/nusa_datasets/indonesian_wsd/indonesian_wsd.py.
[ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

SamuelCahyawijaya commented 1 year ago

/test indonesian_wsd

github-actions[bot] commented 1 year ago

Run result

Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3116253872

SamuelCahyawijaya commented 1 year ago

/test dataset=indonesian_wsd

github-actions[bot] commented 1 year ago

Run result

Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3116269726

muhsatrio commented 1 year ago

/test dataset=indonesian_wsd

github-actions[bot] commented 1 year ago

Run result

Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3120028826

bryanwilie commented 1 year ago

Hi @muhsatrio, thank you for the revision!

If you checked from this line here, the source label atas is actually concatenated with the text itself.

Just for suggestion, will it better to put the source label atas in the text_2_name t2t schema?

And one more thing, could the __init__ file be removed from the commits too?

Thanks again!

muhsatrio commented 1 year ago

Hi @muhsatrio, thank you for the revision!

If you checked from this line here, the source label atas is actually concatenated with the text itself.

Just for suggestion, will it better to put the source label atas in the text_2_name t2t schema?

And one more thing, could the __init__ file be removed from the commits too?

Thanks again!

Hi @bryanwilie, just updated based on your review, please recheck again, thank you!

muhsatrio commented 1 year ago

/test dataset=indonesian_wsd

github-actions[bot] commented 1 year ago

Run result

Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3121632292

SamuelCahyawijaya commented 1 year ago

Hello, I think this dataset is not that simple tho. According to the paper, it should be a classification task given two sentences from English and Indonesian. Perhaps @rmahendra could help us to describe the process that needs to be done to get the classification dataset. Thank you!

SamuelCahyawijaya commented 1 year ago

Hello @muhsatrio, aku bahas sama Pak Rahmad Mahendra, and you can find his suggestion is as follow:

Sepengetahuan saya (setidaknya untuk eksperimen non deep learning), task WSD biasanya training model per word.

Input ke classifier given model, sentence
Output: sense_id

Misalnya di data yang dirilis, ada 6 kata. Bangun 6 binary classification model (ada 2 sense id per kata)

So, I think classification for prediting sense_idwould be more appropriate on this, although I feel that using sense_id as the label is not ideal either. Do you have any other suggestion on this? cc @bryanwilie @muhsatrio

muhsatrio commented 1 year ago

/test dataset=indonesian_wsd

muhsatrio commented 1 year ago

Hi kak @SamuelCahyawijaya, I had commited some changes. Based on yout past suggestions, first of all I try to find the word value based on sense_id to Bahasa WordNet website (as far as I know it is https://bahasa.cs.ui.ac.id/iwn/wordnet.php), but I just get blank page result after I filled the input word and click submit button. I think I will use second option, use sense_id as label, wdyt?

SamuelCahyawijaya commented 1 year ago

Hi @muhsatrio, thank you for the update and sorry for the late reply. Well, yeah, I think in that case, that is the best we can do for now. Let's do the second option then!!

muhsatrio commented 1 year ago

Hi @muhsatrio, thank you for the update and sorry for the late reply. Well, yeah, I think in that case, that is the best we can do for now. Let's do the second option then!!

For the second option had been done actually kak @SamuelCahyawijaya , can recheck again yap!

IndoNLP / nusa-crowd