Closed armingh2000 closed 1 year ago
The problem can be fixed by only changing the predict function. So there is no need to change tokenize function
Have you looked at the refactored implementation? We do add space chars to tokens too. I'm closing this as resolved, if you feel something is missing, feel free to re-open!
I think in tokenize function, if it's tokenizing words, it should add space_character to tokens too. Otherwise, in predict function, it will assume '' for spaces and the predictions doesn't have spaces between them (which can be solved by manipulating the predict function to this line:
return ''.join([vocab.idx_to_token[i] + ' ' for i in outputs])
)I think tokenized should change like this:
[line.split() for line in lines] + [[' ']]
If I'm right, I can make a pr for both tokenize and predict funcs. (although for predict I might have to change inputs of function as well to recognize if it's a char level or word lever rnn)