Antimalweb / URLNet

Code for the paper URLNet - Learning a URL Representation with Deep Learning for Malicious URL Detection
Apache License 2.0
152 stars 60 forks source link

utils.py get_words function #12

Closed theodorehu95 closed 4 years ago

theodorehu95 commented 5 years ago

utils.py line 120

When extracting the words, why does the loop need to break if the word_id is Unknown? Should it be "continue" instead to skip unknown words?

henryhungle commented 5 years ago

The get_words function is called with the full vocabulary actually i.e. with min_word_freq=0, to account for all possible words (line 89 and 90 in train.py). Therefore, when word_id==0, it is the end of the url sequence already and we can break the loop. This is done so that all possible words are delimited in word_x and character-level sequence in each word is accounted. The unknown word are defined as those not in high_freq_words in the ngram_id_x function in line 91 in train.py.