IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.
Apache License 2.0
262 stars 62 forks source link

Create dataset loader for XPersona Id #36

Closed SamuelCahyawijaya closed 2 years ago

SamuelCahyawijaya commented 2 years ago

https://indonlp.github.io/nusa-catalogue/card.html?xpersona_id

muhsatrio commented 2 years ago

self-assign

SamuelCahyawijaya commented 2 years ago

I think this one requires some formatting before it can fit the current schema. I think in general, we can follow nusantara_t2t schema and we add 1 data for each system turn utterance with the text_1 for the dialogue history, formatted, text_2 for the response sentence, text_1_name for the persona, and text_2_name just a string "response".

The id can be the "{dialogueid}{dialogue turn}". if there is no dialogue id provided then just enumerate the data. For the dialogue_turn, we can enumerate system utterance with the first system utterance corresponds to 0.

The format of the text_1 could be something like: U: <user_utterance> | S: <system_utterance> | U: <user_utterance>

muhsatrio commented 2 years ago

Okay got it @SamuelCahyawijaya, thank you! For source schema do you have any suggestion how I implement it?