Open amillert opened 4 years ago
@Namibillow, @anareyegen, @emrecanbaz, let's finally decide together what type of data do we need in the corpus (for this task we're generating ngrams). Options to consider:
My reply to your asked questions:
Do we care about case sensitivity? We can just lowercase all.
Do we want to generate "dialogs"? If not, we may remove sentences from raw data that contain quotation marks, etc. Hmm most of the stories seem to contain 'conversations'. I think we can keep them as they are.
Is interpunction relevant to us? Do we consider punctuation as tokens then? I say keep them since punctuation marks are kind of relevant and we would like to generate text with punctuation. But I guess be careful to not split words like e.g. "Mr." or phrases "mother-in-law" but rather treat them as a whole word.
Are we fine with some documents' manual preprocessing, such as removing chapters' names or some descriptions? It will simplify code at least a bit. Depends on what kind of manual preprocessing. Removing metadata and descriptions for the training set, then I say why not as long as it's doable.
PyTorch Dataset's
perspective,