Closes #36 | feat(XPersona ID): add dataloader

IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.

Apache License 2.0

261 stars 61 forks source link

Closes #36 | feat(XPersona ID): add dataloader #201

Closed muhsatrio closed 2 years ago

muhsatrio commented 2 years ago

Closes #36

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script nusantara/nusa_datasets/xpersona_id/xpersona_id.py (please use only lowercase and underscore for dataset naming).
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _NUSANTARA_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one NusantaraConfig for the source schema and one for a nusantara schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_nusantara --path=nusantara/nusa_datasets/xpersona_id/xpersona_id.py.

muhsatrio commented 2 years ago

Hi @muhsatrio, the dataset works well, one more thing, could you add the citation to the IndoNLG benchmark as well to make sure it is consistent with the other benchmark's datasets?

The bibtex is as follow:

@inproceedings{cahyawijaya-etal-2021-indonlg,
    title = "{I}ndo{NLG}: Benchmark and Resources for Evaluating {I}ndonesian Natural Language Generation",
    author = "Cahyawijaya, Samuel  and
      Winata, Genta Indra  and
      Wilie, Bryan  and
      Vincentio, Karissa  and
      Li, Xiaohong  and
      Kuncoro, Adhiguna  and
      Ruder, Sebastian  and
      Lim, Zhi Yuan  and
      Bahar, Syafri  and
      Khodra, Masayu  and
      Purwarianti, Ayu  and
      Fung, Pascale",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.699",
    doi = "10.18653/v1/2021.emnlp-main.699",
    pages = "8875--8898"
}

Thank you!

Hi kak @SamuelCahyawijaya, for this comment, I had made an adjustment ya, kindly check again. Thank you!

muhsatrio commented 2 years ago

Hi @muhsatrio , thanks for contributing, I found a minor problem on the dataset. I think for the first turn we should have only 1 user utterance without any system utterance at the beginning of a dialogue, for example, on the first t2t training sample we have:
{'id': '0_0', 'text_1': 'U: Hai apa kabar ? Saya bersiap-siap untuk melakukan cheetah mengejar untuk tetap bugar. | S: Anda pasti sangat cepat. berburu adalah salah satu hobi favorit saya. | U: saya ! untuk hobi saya, saya suka melakukan pengalengan atau sedikit merengek.', 'text_2': 'saya juga merombak rumah ketika saya tidak berburu busur.', 'text_1_name': 'Saya suka merombak rumah. | saya suka pergi berburu. | saya suka menembak busur. | liburan favorit saya adalah halloween.', 'text_2_name': 'response'}
This should be the second sample for the given dialogue, and the first one should be:
{'id': '0_0', 'text_1': 'U: Hai apa kabar ? Saya bersiap-siap untuk melakukan cheetah mengejar untuk tetap bugar.', 'text_2': 'Anda pasti sangat cepat. berburu adalah salah satu hobi favorit saya.', 'text_1_name': 'Saya suka merombak rumah. | saya suka pergi berburu. | saya suka menembak busur. | liburan favorit saya adalah halloween.', 'text_2_name': 'response'}
This is important for a dialogue system since this represents the very first turn when the user starts to interact with the system. Could you please update the dataset accordingly? Thank you!

Hi kak @SamuelCahyawijaya, sure! Just updated based on your review yap. Kindly check again, thank you!