Implement sense embeddings supervised baseline from Hu et al

BarbaraMcG commented 4 years ago

Hu et al. (2019)'s paper is summarised in #35 . Their code is here.

@kasparvonbeelen started implementing their method in #18

Currently needs to wait for #46

fedenanni commented 4 years ago

@BarbaraMcG @kasra-hosseini @kasparvonbeelen if you would like we could speak about this early next week (right after we have done a small PR on training and test splits)

kasra-hosseini commented 4 years ago

Sounds good. We can discuss this in our co-working session on Monday?

fedenanni commented 4 years ago

Perfect!

BarbaraMcG commented 4 years ago

I have time to go over this on Tuesday between the reading group and our catch-up, so between 10.45 and 11.30. If you're free, let me know!

BarbaraMcG commented 4 years ago

A few thoughts about how we could go about implementing this method in our case:

Hu et al. use contemporary English dictionary data: they learn sense embeddings based on the examples associated to each sense in a contemporary dictionary, then they disambiguate a token of a word in context by finding the sense that is most similar to its embedding by using cosine similarity between the token embedding and the sense embeddings. They also use this method to track the distribution of senses over time. However, that's the only occasion they actually apply their method in a "diachronic" context. If we apply their method to the OED, we will need to make it diachronic.
How to make this method diachronic? That's probably one of the original contributions of our project and is not trivial. We would need to train the sense embeddings based on the time-stamped quotations (i.e. example sentences) from the OED, so we would have time-sensitive (or diachronic) sense embeddings.
Data sparsity Having to build sense embeddings for each time slice brings the potential problem of data sparsity. Do we have enough quotations for each time period to lead to meaningful results? What is the best periodisation to preserve enough data while at the same time keeping the task diachronic? Decades, 20-year intervals, 50-year intervals?

Some ideas for implementation steps:

Data collection: collect quotations, their dates, and the senses from the OED. Possibly cluster the senses into groups. Start with the target word machine.
Retrieve the pre-trained BERT model by Devlin et al. (2018), specifically the uncased BertBase model with 12 layers, 768 hidden units, 12 heads and 110M parameters. This is is trained on BookCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words) with Masked LM and Next Sentence Prediction tasks.
Build sense embeddings: from Hu et al. "After feeding the sentences containing a target word with a specific sense, its token representations can be generated from the hidden layers of the pre-trained model. We only keep the token representations of the final hidden layer of the Transformer. After obtaining the token embeddings of the target word for the specific sense, we can represent the sense as a 768-dimensional embedding by averaging the token embeddings."
Try with different time intervals (5 years, 10 years, 20 years, ...) and build diachronic sense embeddings for each period (if we have enough data). This can be done to compare Hu et al.'s sense embeddings with a diachronic version of them.
Tag a new sentence with the "correct" sense in context containing machine from the corpus by getting the token's contextual embedding and calculating the cosine similarities between the token embedding and each of the word sense embeddings. The sense with the highest similarity score is the "correct" sense.
Track the distributions of senses over time. Take a preprocessed and POS tagged diachronic corpus, for example the BL Books, but also the COHA corpus for comparison with Hu et al. Retrieve the token embeddings of machine (and other words). Using the sense embeddings built in 4, tag the sense for each token. Hu et al. follow Tang et al. (2016) to decompose the time series of the probabilities of these senses (see page 3902 of their article) to separate a trend component and a random noise.

@kasparvonbeelen , @kasra-hosseini and @fedenanni , I hope this helps! let me know if you'd like to discuss this further.

kasparvonbeelen commented 3 years ago

@BarbaraMcG , thanks for this! Having had a closer look at the OED, and I see the following issues:

Data sparsity is a real issue: lemmas have many senses with only a few quotations. The number of quotations, in general, is too small to build diachronic sense-embeddings (i.e. an average 2-3 quotations per sense).
Also, we don't know if a selected quotation is representative for a specific time period (or an outlier?). Do they follow the distribution of senses in the "real world" (i.e. if we have a quotation for a specific year, does it mean the sense if more prevalent in that period?).

To tackle the first point, we could propose diachronically aware concept-embeddings, creating vector representation for a set of senses (e.g. the machine-senses and their synonyms). This would probably give is enough quotations to create concept embedding that can change with time. The question then becomes: how to create/train these embedding?

However, the second point is more fundamental (are quotations representative of their time period, or merely an attestation that the sense "existed"), and I wouldn't have an answer to that.

I propose that for now, we could at least do the following experiment:

Firstly: Implement the method of Hu et al, and assess of our historical BERT improves the sense disambiguation task
Secondly: Does historical BERT yield better representations? i.e. the cosine similarity between a query word vector and the correct sense-embedding higher compared to contemporary BERT.

This would answer the question if a time-sensitive language model works better for historical data.

BarbaraMcG commented 3 years ago

@BarbaraMcG , thanks for this! Having had a closer look at the OED, and I see the following issues:

Data sparsity is a real issue: lemmas have many senses with only a few quotations. The number of quotations, in general, is too small to build diachronic sense-embeddings (i.e. an average 2-3 quotations per sense).

Also, we don't know if a selected quotation is representative for a specific time period (or an outlier?). Do they follow the distribution of senses in the "real world" (i.e. if we have a quotation for a specific year, does it mean the sense if more prevalent in that period?).

To tackle the first point, we could propose diachronically aware concept-embeddings, creating vector representation for a set of senses (e.g. the machine-senses and their synonyms). This would probably give is enough quotations to create concept embedding that can change with time. The question then becomes: how to create/train these embedding?

However, the second point is more fundamental (are quotations representative of their time period, or merely an attestation that the sense "existed"), and I wouldn't have an answer to that.

I propose that for now, we could at least do the following experiment:

Firstly: Implement the method of Hu et al, and assess of our historical BERT improves the sense disambiguation task

Secondly: Does historical BERT yield better representations? i.e. the cosine similarity between a query word vector and the correct sense-embedding higher compared to contemporary BERT.

This would answer the question if a time-sensitive language model works better for historical data.

Nice! Regarding the second point (about representativeness of the OED quotations), I think it would be good to get an empirical answer to this question by addressing the tasks themselves, and see if the historical embeddings help them or not

BarbaraMcG commented 3 years ago

A synchronic version of this is implemented in #44

kasparvonbeelen commented 3 years ago

@BarbaraMcG To clarify my idea: I don't think we can make diachronic sense embedding with OED data, but we can make the disambiguation task sensitive to time.

E.g. given a query in the form of a quotation (with a target word we want to disambiguate) and a year

for each of the candidate senses, rank quotations by distance in time to the query:
- simply take to vector for of a target word (in the quotations) closest to the query in time
- or combine vectors from all quotations in an intelligent way (e.g. quotations further in time contribute less), we'd compute kind of a weighted average for all target word vectors (time-weighted sense embedded?)
  - additionally, we can obtain vector representations from different language model (those we trained earlier for animacy paper), e.g. obtain a vector from the model that overlaps with the year of the quotation, or just obtain vector representations from all models)
  - this will generate a vector for each candidate sense incorporating information from historical language models and the historical distance between query and candidates.
then we can simply look at the closest neighbour?

Additionally, we could automatically extend the number of labelled quotations, with label propagation, e.g. given a corpus with time-stamps, take the occurrences of the word "machine" that "look like" those observed in the quotations and add them that quotations used for disambiguation. These labels will be imperfect, but could nonetheless help.

Living-with-machines / TargetedSenseDisambiguation

Implement sense embeddings supervised baseline from Hu et al #41