prepare Huang et al. (2019)

What is the paper about?

The authors propose a new neural-based approach to Word Sense Disambiguation (WSD) that leverages gloss (i.e. definition of a sense) information from WordNet. They treat WSD as a sentence-pair classification problem, using BERT.

They build context-gloss sentence pairs, in the input format required by BERT, where the first sentence is the context (i.e. the sentence where the target word occurs) and the second sentence is the gloss of a specific WordNet sense for a given lemma (in the following example, the lemma is "long" and the gloss is "desire strongly or persistently"):

[CLS] How long has it been since you reviewed the objectives of your benefit and service program ? [SEP] desire strongly or persistently [SEP]

They train three different BERT models (see below). For each target word in a context, there are as many context-gloss pair training instances as the number of possible glosses of this target word, which will be either positive matches or negative (in the example, the label would be negative). For testing, they output the probability of label and choose the sense with the highest probability.

They experiment with three BERT models:

GlossBERT (Token-CLS) takes the final hidden state of the token corresponding to the target word (e.g. "long" in the example), and adds a classification layer (yes/no)
GlossBERT (Sent-CLS) takes the final hidden state of the first token [CLS] as a representation of the whole sequence, and adds a classification layer (yes/no). This does not highlight the target word!
GlossBERT (Sent-CLS-WS): the same as (Sent-CLS), but the target word is signalled both in the context sentence (through quotation marks) and the gloss sentence (as a prefix):

[CLS] How " long " has it been since you reviewed the objectives of your benefit and service program ? [SEP] long: a subsequent examination of a patient for the purpose of monitoring earlier treatment [SEP]

The authors call this adding weak supervision to the context-gloss pairs.

The authors note that adding this weak supervision yields the best performance, probably because it combines the advantages of the other two methods.

Is it relevant to our project? If so, why and how? What could we use from this work in our project?

It could be a baseline, or even we could create our final method out of this, if we could add time as a feature, to make it diachronic. The code seems clean and well-documented (https://github.com/HSLCY/GlossBERT). It is a quite clean approach, simple and intuitive. They say that "it is quite expensive to train the model", but don't quantify the claim.

Living-with-machines / TargetedSenseDisambiguation

prepare Huang et al. (2019) #74

What is the paper about?

Is it relevant to our project? If so, why and how? What could we use from this work in our project?

Add some text about it to Overleaf

Plan experiments (if appropriate)