Question about tokenization for training set and validation set

sanittawan commented 5 years ago

Hi Professor Soltoff,

I have another question about tokenization. As we were given the train and the validation sets separately, I am a bit confused if we have to (1) tokenize them separately or (2) pool them together, tokenize, and split. What I am most curious is that since we are looking at 10,000 most common words, if we tokenize them separately, won't the tokenizer only look for 10,000 most common words in train and validation sets separately? Will separately tokenize the two files be different from joining the train and validation sets together and then tokenize?

Specifically, the 6.1.3 example given in the book on IMDB data set (listing 6.19 for Python version or here) shows how they tokenize one big train dataset before splitting it into the training set and validation set.

Thank you in advance!

bensoltoff commented 5 years ago

To avoid data leakage, you should only use the training set to tokenize. Basically use the keras tokenizer function to construct your dictionary of most frequent terms. Then you can use that tokenizer function to process all three data sets. This ensures consistency (ie no unexpected words in the validation set) without using the validation or test sets to determine preprocessing.

In practice, there should be very little difference across the three datasets since they are random samples and each is sufficiently large.

sanittawan commented 5 years ago

Got it. Thank you!

css-research / hw02

Question about tokenization for training set and validation set #2