Extend dataset to include Brown/Semcor Corpus - Githubissues

danielbis / word2sense

LSTM for creating contextualized word2vec vectors, trained for minimizing the distance between synonyms on WordNet tagged corpus

0 stars 0 forks source link

Extend dataset to include Brown/Semcor Corpus #12

Open danielbis opened 5 years ago

danielbis commented 5 years ago

We want to be able to train the model on a larger corpus, therefore semcor should be used.

Potential issues:

. Ontonotes dataset uses onto_sense, more coarse-grained sense definitions than wordnet, because of that extending sense2id, sense --> related (sense2related) mappings needs to be done carefully
Same wordnet sense may be linked to two different onto notes sense and vice-versa. In short, there is a many to many relationship.

Solutions:

Extend the sense2id, sense2related mappings by just appending the new wordnet_sense keys to appropriate ids and related words. Notice that wordnet sense tags may be the same as our converted on_sense tags.