Preprocess Data - Githubissues

amillert commented 4 years ago

Ensure proper format from the PyTorch Dataset's perspective,
Ensure correctness of the data in terms of the learning process

amillert commented 4 years ago

@Namibillow, @anareyegen, @emrecanbaz, let's finally decide together what type of data do we need in the corpus (for this task we're generating ngrams). Options to consider:

Do we care about case sensitivity?
Do we want to generate "dialogs"? If not, we may remove sentences from raw data that contain quotation marks, etc.
Is interpunction relevant to us? Do we consider punctuation as tokens then?
Are we fine with some documents' manual preprocessing, such as removing chapters' names or some descriptions? It will simplify code at least a bit.

Namibillow commented 4 years ago

My reply to your asked questions:

Do we care about case sensitivity? We can just lowercase all.
Do we want to generate "dialogs"? If not, we may remove sentences from raw data that contain quotation marks, etc. Hmm most of the stories seem to contain 'conversations'. I think we can keep them as they are.
Is interpunction relevant to us? Do we consider punctuation as tokens then? I say keep them since punctuation marks are kind of relevant and we would like to generate text with punctuation. But I guess be careful to not split words like e.g. "Mr." or phrases "mother-in-law" but rather treat them as a whole word.
Are we fine with some documents' manual preprocessing, such as removing chapters' names or some descriptions? It will simplify code at least a bit. Depends on what kind of manual preprocessing. Removing metadata and descriptions for the training set, then I say why not as long as it's doable.

amillert / pic2story

Preprocess Data #9