Question about dataset - Githubissues

HLTCHKUST / PAML

Personalizing Dialogue Agents via Meta-Learning

MIT License

126 stars 24 forks source link

Question about dataset #6

Closed soroushjavdan closed 5 years ago

soroushjavdan commented 5 years ago

Hello I downloaded the ConvAI2 dataset from ParlAI, and the size of train_self_original.txt was something around two times bigger than what is inside the repository. I would like to ask, why the size is different? And which dataset should be used for training purposes. And I would like to know how *_persona_map files were created?

Thanks in advance

zlinao commented 5 years ago

Hi, our dataset was downloaded from the original Persona Chat paper, the dataset from convAI2 competition maybe bigger. For the persona_map, we used K-mean to cluster the persona description into 1150 personas, as the original dataset has some noise in the persona description. Check https://github.com/HLTCHKUST/PAML/blob/master/data/ConvAI2/data_analysis.ipynb for more detail.

soroushjavdan commented 5 years ago

@zlinao thanks for your answer