Why was the WordVocab generated using only the training set data？

HelenGuohx / logbert

log anomaly detection via BERT

MIT License

240 stars 102 forks source link

Why was the WordVocab generated using only the training set data？ #43

Open LINBEIXL opened 8 months ago

LINBEIXL commented 8 months ago

If the words in the test set are not recorded in the VOCAB, then during testing, they will all be unk_index?

YifeiLin0226 commented 4 months ago

This step is to avoid the data leakage, the embedding layer has a fixed size so even when you include a new log event from the test set, its corresponding embedding cannot be learned during training. Thus, these new events are all mapped to unknown indices. This of course would cause OOV(Out-Of-Vocabulary) issue for a parser-based method