CQCL / lambeq

A high-level Python library for Quantum Natural Language Processing
https://cqcl.github.io/lambeq-docs
Apache License 2.0
451 stars 108 forks source link

<unk> token feature in the forward() function #16

Closed ACE07-Sev closed 2 years ago

ACE07-Sev commented 2 years ago

Considering the necessity of the token for the never-seen before entities, how can I implement the token in the forward function to allow for the model to calculate probabilities for the instances which have unknown symbols. Based on my understanding and guide from one of the moderators I think it's supposed to be in the forward function.

Could you kindly assist me in implementing this?

dimkart commented 2 years ago

Hi @ACE07-Sev, the standard approach in NLP for covering unknown words is to add in your vocabulary a special token (<UNK>), to which you assign all words that occur in your corpus/dataset less than a certain number of times, e.g. all words that occur less than 3 times. You train this token as any other token, and during testing you use it to represent any word that is not included in the vocabulary.

ACE07-Sev commented 2 years ago

How can I define the token? Code-wise I mean. I'm not quite sure as what var type it is. Could you kindly assist me in coding it in the forward function? I know I have to check and assign based on model.symbols which are all the already seen before symbols, but not quite sure how to assign that to them.

dimkart commented 2 years ago

You need to apply a pre-processing step on your data. You don't have to change your forward method. If you are using one of the readers that are not based on syntax, things are really easy, so let's see this case first. Write a script that does the following:

  1. Count how many times each word occurs in your corpus/dataset.
  2. Replace each word that occurs less than e.g. 3 times with UNK, creating a new version of your dataset. (You might need to do some trial and error to find the right threshold for your data.)
  3. Use this new version to train your model as usual.

For a syntax-based model (discocat, tree-reader) the process is the same, with the complication that you need more than one UNK tokens, one for each grammatical type. So in Step 1 above, you count how many times each word/type combination occurs in your data, and you create an UNK token specific to the type (one for nouns, one for transitive verbs etc).

During evaluation time, if a word is not included in your vocabulary, you replace it with the UNK token.

ACE07-Sev commented 2 years ago

I've made the function, just a quick question. Should I do the occurrence check on the entire dataset (train + validation + test) or just on training?

dimkart commented 2 years ago

Hi, you count occurrences on all three datasets (train + validation + test) but you train only on the train set.

dimkart commented 2 years ago

This will be now closed.