self-assign

SamuelCahyawijaya commented 2 years ago

I think this one requires some formatting before it can fit the current schema. I think in general, we can follow nusantara_t2t schema and we add 1 data for each system turn utterance with the text_1 for the dialogue history, formatted, text_2 for the response sentence, text_1_name for the persona, and text_2_name just a string "response".

The id can be the "{dialogueid}{dialogue turn}". if there is no dialogue id provided then just enumerate the data. For the dialogue_turn, we can enumerate system utterance with the first system utterance corresponds to 0.

The format of the text_1 could be something like: U: <user_utterance> | S: <system_utterance> | U: <user_utterance>

muhsatrio commented 2 years ago

Okay got it @SamuelCahyawijaya, thank you! For source schema do you have any suggestion how I implement it?

IndoNLP / nusa-crowd

Create dataset loader for XPersona Id #36

self-assign