Open LINBEIXL opened 8 months ago
This step is to avoid the data leakage, the embedding layer has a fixed size so even when you include a new log event from the test set, its corresponding embedding cannot be learned during training. Thus, these new events are all mapped to unknown indices. This of course would cause OOV(Out-Of-Vocabulary) issue for a parser-based method
If the words in the test set are not recorded in the VOCAB, then during testing, they will all be unk_index?