Closed muhsatrio closed 2 years ago
Hi @muhsatrio, the dataset works well, one more thing, could you add the citation to the IndoNLG benchmark as well to make sure it is consistent with the other benchmark's datasets?
The bibtex is as follow:
@inproceedings{cahyawijaya-etal-2021-indonlg, title = "{I}ndo{NLG}: Benchmark and Resources for Evaluating {I}ndonesian Natural Language Generation", author = "Cahyawijaya, Samuel and Winata, Genta Indra and Wilie, Bryan and Vincentio, Karissa and Li, Xiaohong and Kuncoro, Adhiguna and Ruder, Sebastian and Lim, Zhi Yuan and Bahar, Syafri and Khodra, Masayu and Purwarianti, Ayu and Fung, Pascale", booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2021", address = "Online and Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.emnlp-main.699", doi = "10.18653/v1/2021.emnlp-main.699", pages = "8875--8898" }
Thank you!
Hi kak @SamuelCahyawijaya, for this comment, I had made an adjustment ya, kindly check again. Thank you!
Hi @muhsatrio , thanks for contributing, I found a minor problem on the dataset. I think for the first turn we should have only 1 user utterance without any system utterance at the beginning of a dialogue, for example, on the first
t2t
training sample we have:{'id': '0_0', 'text_1': 'U: Hai apa kabar ? Saya bersiap-siap untuk melakukan cheetah mengejar untuk tetap bugar. | S: Anda pasti sangat cepat. berburu adalah salah satu hobi favorit saya. | U: saya ! untuk hobi saya, saya suka melakukan pengalengan atau sedikit merengek.', 'text_2': 'saya juga merombak rumah ketika saya tidak berburu busur.', 'text_1_name': 'Saya suka merombak rumah. | saya suka pergi berburu. | saya suka menembak busur. | liburan favorit saya adalah halloween.', 'text_2_name': 'response'}
This should be the second sample for the given dialogue, and the first one should be:
{'id': '0_0', 'text_1': 'U: Hai apa kabar ? Saya bersiap-siap untuk melakukan cheetah mengejar untuk tetap bugar.', 'text_2': 'Anda pasti sangat cepat. berburu adalah salah satu hobi favorit saya.', 'text_1_name': 'Saya suka merombak rumah. | saya suka pergi berburu. | saya suka menembak busur. | liburan favorit saya adalah halloween.', 'text_2_name': 'response'}
This is important for a dialogue system since this represents the very first turn when the user starts to interact with the system. Could you please update the dataset accordingly? Thank you!
Hi kak @SamuelCahyawijaya, sure! Just updated based on your review yap. Kindly check again, thank you!
Closes #36
Checkbox
nusantara/nusa_datasets/xpersona_id/xpersona_id.py
(please use only lowercase and underscore for dataset naming)._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_NUSANTARA_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneNusantaraConfig
for the source schema and one for a nusantara schema.datasets.load_dataset
function.python -m tests.test_nusantara --path=nusantara/nusa_datasets/xpersona_id/xpersona_id.py
.