Files in the Reddit_data_folder

facebookresearch / EmpatheticDialogues

Dialogue model that produces empathetic responses when trained on the EmpatheticDialogues dataset.

Other

450 stars 63 forks source link

Files in the Reddit_data_folder #14

Closed sravankumarlalam closed 5 years ago

sravankumarlalam commented 5 years ago

Hi, May I know what files should be present in the REDDIT _DATA_FOLDER and what are the formats of those files? so that it becomes easy for me to convert raw Reddit dataset into required files necessary for pre-training the model.

EricMichaelSmith commented 5 years ago

Hi there,

See https://github.com/facebookresearch/EmpatheticDialogues/blob/master/empchat/datasets/reddit.py#L18 for the format of the data in this folder: it should consist of a series of numbered, chunked PyTorch .pth files that contain the keys ('w', 'cstart', etc.) indicated in that function.

sravankumarlalam commented 5 years ago

Thanks for your reply. I have a few questions

In https://github.com/facebookresearch/EmpatheticDialogues/blob/f6352667bb1547ebeac68cd07932597a49f0167d/retrieval_train.py#L216 it is taking train_data from only one chunkedPyTorch.pth file i.e you are passing epoch_id % 999 as argument to the https://github.com/facebookresearch/EmpatheticDialogues/blob/f6352667bb1547ebeac68cd07932597a49f0167d/empchat/datasets/loader.py#L154 ...But is supposed to take all chunked pytorch.pth files for one epoch right? or am i wrong ?
How many chunked pytorch.pth files you have for 1.7B reddit comments

EricMichaelSmith commented 5 years ago

We have this set up so that 1 epoch == 1 chunked pth file
There are 1000 chunked pth files: files 0 through 998 are used for training and 999 is used for validation (see line 190 of loader.py)

sravankumarlalam commented 5 years ago

Thanks for the clarification. I still didn't get what the keys represent in the class RedditDataset .can I have a sample of the Reddit dataset to understand the format, please or just tell me what the getitem in the class returns i.e what context and pos represent. Also, how did you generate the word_dictionary file in the REDDIT_DATA_FOLDER? Does it contain all the unique tokens from the Reddit corpus?

thanks

EricMichaelSmith commented 5 years ago

Hi! I can't give you a sample of the Reddit dataset, but I can help clarify things. context and pos are both Tensors that contain tokenized text: context encodes the text of the context, which goes into the context encoder, and pos encodes the text of the candidate that gets passed into the candidate encoder. Hopefully this makes things clearer?

Also, yes, word_dictionary contains all Reddit tokens. It has the following keys:

words: dict with words as keys and word idxs as tokens
iwords: list of words in order (where the index of each word is given by words, above)
wordcounts: 1D Tensor indexed the same way as iwords, where each value is the frequency of that word in the corpus

sravankumarlalam commented 5 years ago

Thanks a lot. It is really helpful

EricMichaelSmith commented 5 years ago

Sure thing - happy to help!