Smerity / sha-rnn

Single Headed Attention RNN - "Stop thinking with your head"
1.18k stars 134 forks source link

error happened when new token appears in the valid/test data set #8

Open carter54 opened 4 years ago

carter54 commented 4 years ago

Thanks first for such nice paper and work! I'm trying to train a text generation model with my own dataset. The tokenize function in data.py https://github.com/Smerity/sha-rnn/blob/218d748022dbcf32d50bbbb4d151a9b6de3f8bba/data.py#L34 uses split() to tokenize sentence in the train dataset, and add token id in the dict. But in the valid/test dataset, the new tokens are neither added in the dict or tagged as an unknown token. Thus, the following error pop up.

Producing dataset...
Traceback (most recent call last):
  File "main.py", line 121, in <module>
    corpus = data.Corpus(args.data)
  File "/home/haha/sha-rnn/data.py", line 31, in __init__
    self.valid = self.tokenize(os.path.join(path, 'valid.txt'))
  File "/home/haha/sha-rnn/data.py", line 60, in tokenize
    ids[token] = self.dictionary.word2idx[word]
KeyError: 'bower_components'

Do you recommend to use other tokenize method (like word-piece) here?

Thanks again~

Smerity commented 4 years ago

The main issue is if a token occurs in validation or test without appearing in training then it's bad news for the model. The weights will be uninitialized at best.

Using wordpieces would likely be the best solution. You could also do what the Penn Treebank (PTB) did and add each of the words found in validation/test at the start or end of the training file. Not an optimal solution but it is a solution at least. You could also add an unknown token (<unk>) to the dataset as well.