error happened when new token appears in the valid/test data set

Thanks first for such nice paper and work! I'm trying to train a text generation model with my own dataset. The tokenize function in data.py https://github.com/Smerity/sha-rnn/blob/218d748022dbcf32d50bbbb4d151a9b6de3f8bba/data.py#L34 uses split() to tokenize sentence in the train dataset, and add token id in the dict. But in the valid/test dataset, the new tokens are neither added in the dict or tagged as an unknown token. Thus, the following error pop up.

Producing dataset...
Traceback (most recent call last):
  File "main.py", line 121, in <module>
    corpus = data.Corpus(args.data)
  File "/home/haha/sha-rnn/data.py", line 31, in __init__
    self.valid = self.tokenize(os.path.join(path, 'valid.txt'))
  File "/home/haha/sha-rnn/data.py", line 60, in tokenize
    ids[token] = self.dictionary.word2idx[word]
KeyError: 'bower_components'

Do you recommend to use other tokenize method (like word-piece) here?

Thanks again~

Smerity / sha-rnn

error happened when new token appears in the valid/test data set #8