Tokenization in 8.2.2 - Githubissues

armingh2000 commented 3 years ago

I think in tokenize function, if it's tokenizing words, it should add space_character to tokens too. Otherwise, in predict function, it will assume '' for spaces and the predictions doesn't have spaces between them (which can be solved by manipulating the predict function to this line: return ''.join([vocab.idx_to_token[i] + ' ' for i in outputs]))

I think tokenized should change like this:

[line.split() for line in lines] + [[' ']]

If I'm right, I can make a pr for both tokenize and predict funcs. (although for predict I might have to change inputs of function as well to recognize if it's a char level or word lever rnn)

armingh2000 commented 3 years ago

The problem can be fixed by only changing the predict function. So there is no need to change tokenize function

AnirudhDagar commented 1 year ago

Have you looked at the refactored implementation? We do add space chars to tokens too. I'm closing this as resolved, if you feel something is missing, feel free to re-open!

d2l-ai / d2l-en

Tokenization in 8.2.2 #1807