Open BarbaraMcG opened 4 years ago
@BarbaraMcG @kasra-hosseini @kasparvonbeelen if you would like we could speak about this early next week (right after we have done a small PR on training and test splits)
Sounds good. We can discuss this in our co-working session on Monday?
Perfect!
I have time to go over this on Tuesday between the reading group and our catch-up, so between 10.45 and 11.30. If you're free, let me know!
A few thoughts about how we could go about implementing this method in our case:
Some ideas for implementation steps:
@kasparvonbeelen , @kasra-hosseini and @fedenanni , I hope this helps! let me know if you'd like to discuss this further.
@BarbaraMcG , thanks for this! Having had a closer look at the OED, and I see the following issues:
To tackle the first point, we could propose diachronically aware concept-embeddings, creating vector representation for a set of senses (e.g. the machine-senses and their synonyms). This would probably give is enough quotations to create concept embedding that can change with time. The question then becomes: how to create/train these embedding?
However, the second point is more fundamental (are quotations representative of their time period, or merely an attestation that the sense "existed"), and I wouldn't have an answer to that.
I propose that for now, we could at least do the following experiment:
This would answer the question if a time-sensitive language model works better for historical data.
@BarbaraMcG , thanks for this! Having had a closer look at the OED, and I see the following issues:
- Data sparsity is a real issue: lemmas have many senses with only a few quotations. The number of quotations, in general, is too small to build diachronic sense-embeddings (i.e. an average 2-3 quotations per sense).
- Also, we don't know if a selected quotation is representative for a specific time period (or an outlier?). Do they follow the distribution of senses in the "real world" (i.e. if we have a quotation for a specific year, does it mean the sense if more prevalent in that period?).
To tackle the first point, we could propose diachronically aware concept-embeddings, creating vector representation for a set of senses (e.g. the machine-senses and their synonyms). This would probably give is enough quotations to create concept embedding that can change with time. The question then becomes: how to create/train these embedding?
However, the second point is more fundamental (are quotations representative of their time period, or merely an attestation that the sense "existed"), and I wouldn't have an answer to that.
I propose that for now, we could at least do the following experiment:
- Firstly: Implement the method of Hu et al, and assess of our historical BERT improves the sense disambiguation task
- Secondly: Does historical BERT yield better representations? i.e. the cosine similarity between a query word vector and the correct sense-embedding higher compared to contemporary BERT.
This would answer the question if a time-sensitive language model works better for historical data.
Nice! Regarding the second point (about representativeness of the OED quotations), I think it would be good to get an empirical answer to this question by addressing the tasks themselves, and see if the historical embeddings help them or not
A synchronic version of this is implemented in #44
@BarbaraMcG To clarify my idea: I don't think we can make diachronic sense embedding with OED data, but we can make the disambiguation task sensitive to time.
E.g. given a query in the form of a quotation (with a target word we want to disambiguate) and a year
Additionally, we could automatically extend the number of labelled quotations, with label propagation, e.g. given a corpus with time-stamps, take the occurrences of the word "machine" that "look like" those observed in the quotations and add them that quotations used for disambiguation. These labels will be imperfect, but could nonetheless help.
Hu et al. (2019)'s paper is summarised in #35 . Their code is here.
@kasparvonbeelen started implementing their method in #18
Currently needs to wait for #46