Closed sravankumarlalam closed 5 years ago
Hi there,
See https://github.com/facebookresearch/EmpatheticDialogues/blob/master/empchat/datasets/reddit.py#L18 for the format of the data in this folder: it should consist of a series of numbered, chunked PyTorch .pth files that contain the keys ('w'
, 'cstart'
, etc.) indicated in that function.
Thanks for your reply. I have a few questions
Thanks for the clarification. I still didn't get what the keys represent in the class RedditDataset .can I have a sample of the Reddit dataset to understand the format, please or just tell me what the getitem in the class returns i.e what context and pos represent. Also, how did you generate the word_dictionary file in the REDDIT_DATA_FOLDER? Does it contain all the unique tokens from the Reddit corpus?
thanks
Hi! I can't give you a sample of the Reddit dataset, but I can help clarify things. context
and pos
are both Tensors that contain tokenized text: context
encodes the text of the context, which goes into the context encoder, and pos
encodes the text of the candidate that gets passed into the candidate encoder. Hopefully this makes things clearer?
Also, yes, word_dictionary contains all Reddit tokens. It has the following keys:
words
: dict with words as keys and word idxs as tokensiwords
: list of words in order (where the index of each word is given by words
, above)wordcounts
: 1D Tensor indexed the same way as iwords
, where each value is the frequency of that word in the corpusThanks a lot. It is really helpful
Sure thing - happy to help!
Hi, May I know what files should be present in the REDDIT _DATA_FOLDER and what are the formats of those files? so that it becomes easy for me to convert raw Reddit dataset into required files necessary for pre-training the model.