Closed ACE07-Sev closed 2 years ago
Hi @ACE07-Sev, the standard approach in NLP for covering unknown words is to add in your vocabulary a special token (<UNK>
), to which you assign all words that occur in your corpus/dataset less than a certain number of times, e.g. all words that occur less than 3 times. You train this token as any other token, and during testing you use it to represent any word that is not included in the vocabulary.
How can I define the
You need to apply a pre-processing step on your data. You don't have to change your forward method. If you are using one of the readers that are not based on syntax, things are really easy, so let's see this case first. Write a script that does the following:
For a syntax-based model (discocat, tree-reader) the process is the same, with the complication that you need more than one UNK tokens, one for each grammatical type. So in Step 1 above, you count how many times each word/type combination occurs in your data, and you create an UNK token specific to the type (one for nouns, one for transitive verbs etc).
During evaluation time, if a word is not included in your vocabulary, you replace it with the UNK token.
I've made the function, just a quick question. Should I do the occurrence check on the entire dataset (train + validation + test) or just on training?
Hi, you count occurrences on all three datasets (train + validation + test) but you train only on the train set.
This will be now closed.
Considering the necessity of the token for the never-seen before entities, how can I implement the token in the forward function to allow for the model to calculate probabilities for the instances which have unknown symbols. Based on my understanding and guide from one of the moderators I think it's supposed to be in the forward function.
Could you kindly assist me in implementing this?