IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.
Apache License 2.0
261 stars 61 forks source link

Closes #46 | Indonesian WSD data loader #291

Closed muhsatrio closed 1 year ago

muhsatrio commented 1 year ago

Closes #46

Checkbox

SamuelCahyawijaya commented 1 year ago

/test indonesian_wsd

github-actions[bot] commented 1 year ago

Run result

Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3116253872

SamuelCahyawijaya commented 1 year ago

/test dataset=indonesian_wsd

github-actions[bot] commented 1 year ago

Run result

Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3116269726

muhsatrio commented 1 year ago

/test dataset=indonesian_wsd

github-actions[bot] commented 1 year ago

Run result

Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3120028826

bryanwilie commented 1 year ago

Hi @muhsatrio, thank you for the revision!

If you checked from this line here, the source label atas is actually concatenated with the text itself.

Just for suggestion, will it better to put the source label atas in the text_2_name t2t schema?

And one more thing, could the __init__ file be removed from the commits too?

Thanks again!

muhsatrio commented 1 year ago

Hi @muhsatrio, thank you for the revision!

If you checked from this line here, the source label atas is actually concatenated with the text itself.

Just for suggestion, will it better to put the source label atas in the text_2_name t2t schema?

And one more thing, could the __init__ file be removed from the commits too?

Thanks again!

Hi @bryanwilie, just updated based on your review, please recheck again, thank you!

muhsatrio commented 1 year ago

/test dataset=indonesian_wsd

github-actions[bot] commented 1 year ago

Run result

Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3121632292

SamuelCahyawijaya commented 1 year ago

Hello, I think this dataset is not that simple tho. According to the paper, it should be a classification task given two sentences from English and Indonesian. Perhaps @rmahendra could help us to describe the process that needs to be done to get the classification dataset. Thank you!

SamuelCahyawijaya commented 1 year ago

Hello @muhsatrio, aku bahas sama Pak Rahmad Mahendra, and you can find his suggestion is as follow:

Sepengetahuan saya (setidaknya untuk eksperimen non deep learning), task WSD biasanya training model per word.

Input ke classifier given model, sentence
Output: sense_id

Misalnya di data yang dirilis, ada 6 kata. Bangun 6 binary classification model (ada 2 sense id per kata)

So, I think classification for prediting sense_idwould be more appropriate on this, although I feel that using sense_id as the label is not ideal either. Do you have any other suggestion on this? cc @bryanwilie @muhsatrio

muhsatrio commented 1 year ago

/test dataset=indonesian_wsd

muhsatrio commented 1 year ago

Hi kak @SamuelCahyawijaya, I had commited some changes. Based on yout past suggestions, first of all I try to find the word value based on sense_id to Bahasa WordNet website (as far as I know it is https://bahasa.cs.ui.ac.id/iwn/wordnet.php), but I just get blank page result after I filled the input word and click submit button. I think I will use second option, use sense_id as label, wdyt?

SamuelCahyawijaya commented 1 year ago

Hi @muhsatrio, thank you for the update and sorry for the late reply. Well, yeah, I think in that case, that is the best we can do for now. Let's do the second option then!!

muhsatrio commented 1 year ago

Hi @muhsatrio, thank you for the update and sorry for the late reply. Well, yeah, I think in that case, that is the best we can do for now. Let's do the second option then!!

For the second option had been done actually kak @SamuelCahyawijaya , can recheck again yap!