Open carter54 opened 4 years ago
The main issue is if a token occurs in validation or test without appearing in training then it's bad news for the model. The weights will be uninitialized at best.
Using wordpieces would likely be the best solution. You could also do what the Penn Treebank (PTB) did and add each of the words found in validation/test at the start or end of the training file. Not an optimal solution but it is a solution at least. You could also add an unknown token (<unk>
) to the dataset as well.
Thanks first for such nice paper and work! I'm trying to train a text generation model with my own dataset. The tokenize function in data.py https://github.com/Smerity/sha-rnn/blob/218d748022dbcf32d50bbbb4d151a9b6de3f8bba/data.py#L34 uses split() to tokenize sentence in the train dataset, and add token id in the dict. But in the valid/test dataset, the new tokens are neither added in the dict or tagged as an unknown token. Thus, the following error pop up.
Do you recommend to use other tokenize method (like word-piece) here?
Thanks again~