Living-with-machines / TargetedSenseDisambiguation

Repository for the work on Targeted Sense Disambiguation
MIT License
1 stars 0 forks source link

Implement sense embeddings supervised baseline from Hu et al #41

Open BarbaraMcG opened 4 years ago

BarbaraMcG commented 4 years ago

Hu et al. (2019)'s paper is summarised in #35 . Their code is here.

@kasparvonbeelen started implementing their method in #18

Currently needs to wait for #46

fedenanni commented 4 years ago

@BarbaraMcG @kasra-hosseini @kasparvonbeelen if you would like we could speak about this early next week (right after we have done a small PR on training and test splits)

kasra-hosseini commented 4 years ago

Sounds good. We can discuss this in our co-working session on Monday?

fedenanni commented 4 years ago

Perfect!

BarbaraMcG commented 4 years ago

I have time to go over this on Tuesday between the reading group and our catch-up, so between 10.45 and 11.30. If you're free, let me know!

BarbaraMcG commented 4 years ago

A few thoughts about how we could go about implementing this method in our case:

Some ideas for implementation steps:

  1. Data collection: collect quotations, their dates, and the senses from the OED. Possibly cluster the senses into groups. Start with the target word machine.
  2. Retrieve the pre-trained BERT model by Devlin et al. (2018), specifically the uncased BertBase model with 12 layers, 768 hidden units, 12 heads and 110M parameters. This is is trained on BookCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words) with Masked LM and Next Sentence Prediction tasks.
  3. Build sense embeddings: from Hu et al. "After feeding the sentences containing a target word with a specific sense, its token representations can be generated from the hidden layers of the pre-trained model. We only keep the token representations of the final hidden layer of the Transformer. After obtaining the token embeddings of the target word for the specific sense, we can represent the sense as a 768-dimensional embedding by averaging the token embeddings."
  4. Try with different time intervals (5 years, 10 years, 20 years, ...) and build diachronic sense embeddings for each period (if we have enough data). This can be done to compare Hu et al.'s sense embeddings with a diachronic version of them.
  5. Tag a new sentence with the "correct" sense in context containing machine from the corpus by getting the token's contextual embedding and calculating the cosine similarities between the token embedding and each of the word sense embeddings. The sense with the highest similarity score is the "correct" sense.
  6. Track the distributions of senses over time. Take a preprocessed and POS tagged diachronic corpus, for example the BL Books, but also the COHA corpus for comparison with Hu et al. Retrieve the token embeddings of machine (and other words). Using the sense embeddings built in 4, tag the sense for each token. Hu et al. follow Tang et al. (2016) to decompose the time series of the probabilities of these senses (see page 3902 of their article) to separate a trend component and a random noise.

@kasparvonbeelen , @kasra-hosseini and @fedenanni , I hope this helps! let me know if you'd like to discuss this further.

kasparvonbeelen commented 3 years ago

@BarbaraMcG , thanks for this! Having had a closer look at the OED, and I see the following issues:

To tackle the first point, we could propose diachronically aware concept-embeddings, creating vector representation for a set of senses (e.g. the machine-senses and their synonyms). This would probably give is enough quotations to create concept embedding that can change with time. The question then becomes: how to create/train these embedding?

However, the second point is more fundamental (are quotations representative of their time period, or merely an attestation that the sense "existed"), and I wouldn't have an answer to that.

I propose that for now, we could at least do the following experiment:

This would answer the question if a time-sensitive language model works better for historical data.

BarbaraMcG commented 3 years ago

@BarbaraMcG , thanks for this! Having had a closer look at the OED, and I see the following issues:

  • Data sparsity is a real issue: lemmas have many senses with only a few quotations. The number of quotations, in general, is too small to build diachronic sense-embeddings (i.e. an average 2-3 quotations per sense).
  • Also, we don't know if a selected quotation is representative for a specific time period (or an outlier?). Do they follow the distribution of senses in the "real world" (i.e. if we have a quotation for a specific year, does it mean the sense if more prevalent in that period?).

To tackle the first point, we could propose diachronically aware concept-embeddings, creating vector representation for a set of senses (e.g. the machine-senses and their synonyms). This would probably give is enough quotations to create concept embedding that can change with time. The question then becomes: how to create/train these embedding?

However, the second point is more fundamental (are quotations representative of their time period, or merely an attestation that the sense "existed"), and I wouldn't have an answer to that.

I propose that for now, we could at least do the following experiment:

  • Firstly: Implement the method of Hu et al, and assess of our historical BERT improves the sense disambiguation task
  • Secondly: Does historical BERT yield better representations? i.e. the cosine similarity between a query word vector and the correct sense-embedding higher compared to contemporary BERT.

This would answer the question if a time-sensitive language model works better for historical data.

Nice! Regarding the second point (about representativeness of the OED quotations), I think it would be good to get an empirical answer to this question by addressing the tasks themselves, and see if the historical embeddings help them or not

BarbaraMcG commented 3 years ago

A synchronic version of this is implemented in #44

kasparvonbeelen commented 3 years ago

@BarbaraMcG To clarify my idea: I don't think we can make diachronic sense embedding with OED data, but we can make the disambiguation task sensitive to time.

E.g. given a query in the form of a quotation (with a target word we want to disambiguate) and a year

Additionally, we could automatically extend the number of labelled quotations, with label propagation, e.g. given a corpus with time-stamps, take the occurrences of the word "machine" that "look like" those observed in the quotations and add them that quotations used for disambiguation. These labels will be imperfect, but could nonetheless help.