Design evaluation setting and metrics

fedenanni commented 3 years ago

Our paper focuses around two word sense disambiguation (WSD) tasks:

traditional WSD, where we have a word (say machine) in a sentence and we have to predict which sense is the most appropriate, given a predifined list of machine senses.
extended WSD, where we have a word (say locomotive) which is related to (at least one) sense of another word (say machine) and the task is to predict which sense of machine is more appropriate, given the word locomotive in a sentence. The relation could be a synonym, an hyponym etc.

Goal: to define a clear evaluation setting (train/test splitting) and evaluation metrics

TLDR (main points from the discussion below - to be updated):

do we want a single evaluation framework or two different ones? [to know more]
do we have examples in the literature of our "extended WSD"? How is it called / evaluated? possible answer: look into lexical substitution and presudowords in wsd [to know more]
do we want to write a broad diachronic WSD paper or a piece focused on the different senses of the lemma "machine" across time and sources? [to know more]

Currently blocked by #2

BarbaraMcG commented 3 years ago

We need to decide the proportions of training/test/validation, sampling, avoid bias

BarbaraMcG commented 3 years ago

@fedenanni , @mcollardanuy and @GiorgiatolfoBL will take this forward

fedenanni commented 3 years ago

@kasparvonbeelen @kasra-hosseini Tagging you just to be sure we are all in the loop. I think the main question is if we want to have two evaluation settings:

only for "machine" mentions (I am using here machine just as an example)
including both "machine" and "descendants" (for instance "locomotive" etc)

Or a single one.

If we have a single one we need to consider: a. a balanced distribution of senses in train and test b. if we have a descendant with only one label, this should appear either in train or in test.

Feel free to add other things - just sketching it

fedenanni commented 3 years ago

Starting points for literature overview: Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison from the Navigli crew - I'll check it out later today

fedenanni commented 3 years ago

I just spoke with Stefano Faralli, who was in Navigli's group (and then a postdoc at DWS in Mannheim) - he suggests to look into the literature around:

pseudowords in wsd: https://www.aclweb.org/anthology/J14-4005.pdf
lexical substitution: https://en.wikipedia.org/wiki/Lexical_substitution

kasparvonbeelen commented 3 years ago

@fedenanni , thanks for this! Some thoughts

I have the feeling we have slightly changed our initial (rather narrow) research goal from trying to find machines specifically to word sense disambiguation in general. The way we collect data (the words and their provenance) is still more geared towards that former than towards the latter goal.
The question is: do we still care that much about machines, or are we building a more general tool for historical WSD? I agnostic here, but it feels we are maybe making it too difficult for ourselves if we want to accomplish both.
If we are more interested in WSD in general, we could change the way we collect data and evaluate our models: for example, we take the 10.000 most frequent nouns from the BL books collection and get all senses + quotations for these lemmas, and evaluate our methods on a held-out set of quotations (with some caveats). This would be simpler?
Focussing on "machines" with the goals of publishing in an NLP conference, is a difficult balancing act. I guess we have to bite the bullet soon, and deciding on the train/test set, is a good moment :-)

fedenanni commented 3 years ago

@kasparvonbeelen super good point - I think if we want to go with ACL for instance, having a broad historical WSD evaluation (for instance on the top 10k nouns / considering lexicon expansion for them and diachronic infos) + a specific case study on machine (with an associated crowdsourcing task on newspaper) would make a really well-rounded story

The broad evaluation would show how generalizable our conclusions are (across lemmas, senses, periods)

The case study on machine would allow us to go deeper into semantic change around the concept and the perception of different senses/meanings by readers (both experts (so us) and the crowd)

What do others think? (I am updating the TLDR with this passage from Kaspar)

fedenanni commented 3 years ago

I am adding some notes, while reading about evaluation frameworks on WSD. Regarding hyponyms and synonyms of a given sense (say machine_sense1), we should check whether each of these lemmas (say hyp_01) has only one sense (monosemous). In that case the lemma hyp_01 and machine_sense1 are direct lexical substitutes and there is no sense ambiguity between the lemma of hyp_01 and machine_sense1. We would need to either exclude them or evaluate them separately from more complex synonyms / hyponyms (so lemmas that have more than one sense and not all of them will be related to machine_sense1). To know more, see Table 1, page 844 here and the relation between ("coke" and "Pepsi"); basically, if you have Pepsi in a sentence, it will be way easier to predict the correct sense of coke, because Pepsi is monosemous.

BarbaraMcG commented 3 years ago

It may be useful to remind ourselves about what we said our tasks are (from #28 ):

Given a set of senses of a target lemma (e.g. machine001) from a historical dictionary+thesaurus (e.g. OED+HTOED), a time period (e.g. 1800-1914):

can we find synonyms for this set of senses?
Can we find sentences where these senses are realised? These sentences may contain: 2.1. the target lemma (sense labelling task) or 2.2. other lemmas that are "(sense-)synonyms" of the target lemma, i.e. synonyms of that given sense in the given time period (sense-synonym task).

2.1 and 2.2 can, in fact, be seen as separate tasks as 2.2 builds on 2.1: first, we start from a polysemous lemma (e.g. machine) and a new sentence try to assign the right sense of this lemma in this sentence; this way, we build our knowledge of the sense profiles of this lemma; then, we can generalise to other lemmas by using these sense profiles.

We could start from 2.1, which is effectively a WSD task and design the evaluation based on this. BUT, we should avoid the temptation to just do yet another WSD paper because our unique selling point is the historical dimension, both in terms of corpus and dictionary. So, we could try and test the hypothesis that adding temporal information and historical-lexicographic information helps the WSD task.

fedenanni commented 3 years ago

@BarbaraMcG thanks! I completely agree on 2.1 with the selling point of historical context and temporal information

mcollardanuy commented 3 years ago

So, we could try and test the hypothesis that adding temporal information and historical-lexicographic information helps the WSD task.

I think this makes sense. In this case, the date of the quotation is one of the criteria we also should take into account to create the training and test set. I like the idea of having a general and generalizable WSD framework (with a historical focus both on the method and the evaluation), but using the machine as a case study.

BarbaraMcG commented 3 years ago

Task: WSD with time dimension: given a lemma and a historical dictionary sense inventory, and a sentence containing that lemma, match the sentence with the correct sense of the lemma Variation (case study on machine): given a chosen sense, match a sentence with it or not (1 vs All)

BarbaraMcG commented 3 years ago

Ways to integrate time:

weigh quotations more if they are from the same time period and less if they're most distant
...

BarbaraMcG commented 3 years ago

TO DOs:

create dataframe for other words
split quotations between train (80%) and test (20%) to take into account label distribution over time
decide periodization (size of time intervals)

kasparvonbeelen commented 3 years ago

Some notes from our after-discussion. @kasra-hosseini @mcollardanuy please chip and edit this comment directly.

Upon reflection, we thought it better prioritize the binary (or targeted) sense-disambiguation task, instead of starting with general WSD.

Targeted sense disambiguation looks as follows: classify tokens in a text as "belonging to" a sense in the dictionary (or not). What we mean with "belonging to" is something we could discuss, but let's assume it means "token Y is equal to sense X" or "token Y is synonymous to sense X"

The procedure could look as follows (thinking of a simple baseline)

given a specific sense* (e.g. machine as structure is a sense of the lemma machine_nn01), we label the sense itself and all sense-synonyms and their quotations as1, the rest as 0.
Then we should divide these labelled quotations in train and test set
From the train set, we compute the mean vector of the target words (in quotations) labelled as 1 (the vector representing machine as structure) the rest is we combine into the other vector.
For each quotation in the test set, we look if it's closer to machine as structure or other

There are other ways, of course, this is just a baseline (that probably won't work very well).

Some comments:

instead of one sense we could also select a set of senses as initial input (i.e. doing figurative vs non-figurative machine classification)
we should select one or two other case-studies for targeted expansion: "arts" as a synonym of technology, or words that are interesting from a humanities perspective ("nation", "smell") or have more straightforward applications in NLP ("colour"?).

Why targeted (binary) instead of general (multiclass) WSD:

it is closer to the initial goal of lexicon expansion; it is likely to be a more straightforward method to find machines, then performing WSD to all tokens.
it has clear applications to (digital) humanities research
WSD is such a huge field of research, it is unlikely that we can make in contribution to this task, but targeted sense-disambiguation is a related and relevant task, with clear applications.

To do:

@kasparvonbeelen finish review of PR #18
adjust the evaluation framework
select case studies.

fedenanni commented 3 years ago

Hi all and thanks for this! I agree that this is more in line with the final goal and the aggregation of senses would work well also with the crowdsourcing data (as they are annotating group of senses already).

If we all agree on this as the final task, @mcollardanuy and I can quickly re-adapt the eval framework while you finish the PR

kasparvonbeelen commented 3 years ago

Also noting down @kasra-hosseini idea to apply the intuition behind the SemAxis paper to find a contrastive machine vs not-machine dimension in the contextual embedding. https://arxiv.org/abs/1806.05521

The algorithm could look like:

Use the procedure described above for binarizing the labels
for each label
- get the vector of the target work for each of the quotations
- compute the average vector (or do something different but smart)
this creates a vector(machine) vs vector(not-machine)
compute vector(diff) = vector(machine) - vector(not-machine)
project each new vector on vector(diff) and rank by "machiness"
establish cut-off (topn)

BarbaraMcG commented 3 years ago

Great! I'm totally convinced on the binary task. I have one question: how would the time dimension be incorporated?

kasparvonbeelen commented 3 years ago

@BarbaraMcG . Good point. I think we can still make it diachronic similar in the way we discussed it earlier We want to disambiguation a quotation for a given year, and therefore make the method time-sensitive. If this improves the accuracy remains to be seen, but let's hope it does!

BarbaraMcG commented 3 years ago

Yes, I hope so too, but in any case it would be interesting to see.

Living-with-machines / TargetedSenseDisambiguation

Design evaluation setting and metrics #7