chunk.pth in Reddit dataset

facebookresearch / EmpatheticDialogues

Dialogue model that produces empathetic responses when trained on the EmpatheticDialogues dataset.

Other

444 stars 63 forks source link

chunk.pth in Reddit dataset #16

Closed JiunHaoJhan closed 4 years ago

JiunHaoJhan commented 4 years ago

Hi, may I know how to convert the raw data in Reddit dataset to chunk.pth loaded in reddit.py? I have downloaded reddit dataset, but I have no idea how to process the raw data so that this raw data can work in RedditDataset class in reddit.py.

I have checked the issue, but I still can not understand how to deal with the format in the required file.

EricMichaelSmith commented 4 years ago

Hi there! Take a look at https://github.com/facebookresearch/EmpatheticDialogues/blob/master/empchat/datasets/reddit.py#L18 , and see the structure of data when it's loaded in - you'll need to create the 'w', 'cstart', and 'cend' keys to represent the concatenated word tokens of all sentences, the start idxes of all sentences, and the end idxes of all sentences, respectively.

JiunHaoJhan commented 4 years ago

Thanks for your reply.