Living-with-machines / TargetedSenseDisambiguation

Repository for the work on Targeted Sense Disambiguation
MIT License
1 stars 0 forks source link

Design evaluation setting and metrics #7

Open fedenanni opened 3 years ago

fedenanni commented 3 years ago

Our paper focuses around two word sense disambiguation (WSD) tasks:

Goal: to define a clear evaluation setting (train/test splitting) and evaluation metrics

TLDR (main points from the discussion below - to be updated):

Currently blocked by #2

BarbaraMcG commented 3 years ago

We need to decide the proportions of training/test/validation, sampling, avoid bias

BarbaraMcG commented 3 years ago

@fedenanni , @mcollardanuy and @GiorgiatolfoBL will take this forward

fedenanni commented 3 years ago

@kasparvonbeelen @kasra-hosseini Tagging you just to be sure we are all in the loop. I think the main question is if we want to have two evaluation settings:

  1. only for "machine" mentions (I am using here machine just as an example)
  2. including both "machine" and "descendants" (for instance "locomotive" etc)

Or a single one.

If we have a single one we need to consider: a. a balanced distribution of senses in train and test b. if we have a descendant with only one label, this should appear either in train or in test.

Feel free to add other things - just sketching it

fedenanni commented 3 years ago

Starting points for literature overview: Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison from the Navigli crew - I'll check it out later today

fedenanni commented 3 years ago

I just spoke with Stefano Faralli, who was in Navigli's group (and then a postdoc at DWS in Mannheim) - he suggests to look into the literature around:

kasparvonbeelen commented 3 years ago

@fedenanni , thanks for this! Some thoughts

fedenanni commented 3 years ago

@kasparvonbeelen super good point - I think if we want to go with ACL for instance, having a broad historical WSD evaluation (for instance on the top 10k nouns / considering lexicon expansion for them and diachronic infos) + a specific case study on machine (with an associated crowdsourcing task on newspaper) would make a really well-rounded story

The broad evaluation would show how generalizable our conclusions are (across lemmas, senses, periods)

The case study on machine would allow us to go deeper into semantic change around the concept and the perception of different senses/meanings by readers (both experts (so us) and the crowd)

What do others think? (I am updating the TLDR with this passage from Kaspar)

fedenanni commented 3 years ago

I am adding some notes, while reading about evaluation frameworks on WSD. Regarding hyponyms and synonyms of a given sense (say machine_sense1), we should check whether each of these lemmas (say hyp_01) has only one sense (monosemous). In that case the lemma hyp_01 and machine_sense1 are direct lexical substitutes and there is no sense ambiguity between the lemma of hyp_01 and machine_sense1. We would need to either exclude them or evaluate them separately from more complex synonyms / hyponyms (so lemmas that have more than one sense and not all of them will be related to machine_sense1). To know more, see Table 1, page 844 here and the relation between ("coke" and "Pepsi"); basically, if you have Pepsi in a sentence, it will be way easier to predict the correct sense of coke, because Pepsi is monosemous.

BarbaraMcG commented 3 years ago

It may be useful to remind ourselves about what we said our tasks are (from #28 ):

Given a set of senses of a target lemma (e.g. machine001) from a historical dictionary+thesaurus (e.g. OED+HTOED), a time period (e.g. 1800-1914):

  1. can we find synonyms for this set of senses?
  2. Can we find sentences where these senses are realised? These sentences may contain: 2.1. the target lemma (sense labelling task) or 2.2. other lemmas that are "(sense-)synonyms" of the target lemma, i.e. synonyms of that given sense in the given time period (sense-synonym task).

2.1 and 2.2 can, in fact, be seen as separate tasks as 2.2 builds on 2.1: first, we start from a polysemous lemma (e.g. machine) and a new sentence try to assign the right sense of this lemma in this sentence; this way, we build our knowledge of the sense profiles of this lemma; then, we can generalise to other lemmas by using these sense profiles.

We could start from 2.1, which is effectively a WSD task and design the evaluation based on this. BUT, we should avoid the temptation to just do yet another WSD paper because our unique selling point is the historical dimension, both in terms of corpus and dictionary. So, we could try and test the hypothesis that adding temporal information and historical-lexicographic information helps the WSD task.

fedenanni commented 3 years ago

@BarbaraMcG thanks! I completely agree on 2.1 with the selling point of historical context and temporal information

mcollardanuy commented 3 years ago

So, we could try and test the hypothesis that adding temporal information and historical-lexicographic information helps the WSD task.

I think this makes sense. In this case, the date of the quotation is one of the criteria we also should take into account to create the training and test set. I like the idea of having a general and generalizable WSD framework (with a historical focus both on the method and the evaluation), but using the machine as a case study.

BarbaraMcG commented 3 years ago

Task: WSD with time dimension: given a lemma and a historical dictionary sense inventory, and a sentence containing that lemma, match the sentence with the correct sense of the lemma Variation (case study on machine): given a chosen sense, match a sentence with it or not (1 vs All)

BarbaraMcG commented 3 years ago

Ways to integrate time:

  1. weigh quotations more if they are from the same time period and less if they're most distant
  2. ...
BarbaraMcG commented 3 years ago

TO DOs:

kasparvonbeelen commented 3 years ago

Some notes from our after-discussion. @kasra-hosseini @mcollardanuy please chip and edit this comment directly.

Upon reflection, we thought it better prioritize the binary (or targeted) sense-disambiguation task, instead of starting with general WSD.

Targeted sense disambiguation looks as follows: classify tokens in a text as "belonging to" a sense in the dictionary (or not). What we mean with "belonging to" is something we could discuss, but let's assume it means "token Y is equal to sense X" or "token Y is synonymous to sense X"

The procedure could look as follows (thinking of a simple baseline)

There are other ways, of course, this is just a baseline (that probably won't work very well).

Some comments:

Why targeted (binary) instead of general (multiclass) WSD:

To do:

fedenanni commented 3 years ago

Hi all and thanks for this! I agree that this is more in line with the final goal and the aggregation of senses would work well also with the crowdsourcing data (as they are annotating group of senses already).

If we all agree on this as the final task, @mcollardanuy and I can quickly re-adapt the eval framework while you finish the PR

kasparvonbeelen commented 3 years ago

Also noting down @kasra-hosseini idea to apply the intuition behind the SemAxis paper to find a contrastive machine vs not-machine dimension in the contextual embedding. https://arxiv.org/abs/1806.05521

The algorithm could look like:

BarbaraMcG commented 3 years ago

Great! I'm totally convinced on the binary task. I have one question: how would the time dimension be incorporated?

kasparvonbeelen commented 3 years ago

@BarbaraMcG . Good point. I think we can still make it diachronic similar in the way we discussed it earlier We want to disambiguation a quotation for a given year, and therefore make the method time-sensitive. If this improves the accuracy remains to be seen, but let's hope it does!

BarbaraMcG commented 3 years ago

Yes, I hope so too, but in any case it would be interesting to see.