Closed theodorehu95 closed 4 years ago
The get_words function is called with the full vocabulary actually i.e. with min_word_freq=0, to account for all possible words (line 89 and 90 in train.py). Therefore, when word_id==0, it is the end of the url sequence already and we can break the loop. This is done so that all possible words are delimited in word_x and character-level sequence in each word is accounted. The unknown word are defined as those not in high_freq_words in the ngram_id_x function in line 91 in train.py.
utils.py line 120
When extracting the words, why does the loop need to break if the word_id is Unknown? Should it be "continue" instead to skip unknown words?