Open fedenanni opened 3 years ago
We need to decide the proportions of training/test/validation, sampling, avoid bias
@fedenanni , @mcollardanuy and @GiorgiatolfoBL will take this forward
@kasparvonbeelen @kasra-hosseini Tagging you just to be sure we are all in the loop. I think the main question is if we want to have two evaluation settings:
Or a single one.
If we have a single one we need to consider: a. a balanced distribution of senses in train and test b. if we have a descendant with only one label, this should appear either in train or in test.
Feel free to add other things - just sketching it
Starting points for literature overview: Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison from the Navigli crew - I'll check it out later today
I just spoke with Stefano Faralli, who was in Navigli's group (and then a postdoc at DWS in Mannheim) - he suggests to look into the literature around:
@fedenanni , thanks for this! Some thoughts
@kasparvonbeelen super good point - I think if we want to go with ACL for instance, having a broad historical WSD evaluation (for instance on the top 10k nouns / considering lexicon expansion for them and diachronic infos) + a specific case study on machine (with an associated crowdsourcing task on newspaper) would make a really well-rounded story
The broad evaluation would show how generalizable our conclusions are (across lemmas, senses, periods)
The case study on machine would allow us to go deeper into semantic change around the concept and the perception of different senses/meanings by readers (both experts (so us) and the crowd)
What do others think? (I am updating the TLDR with this passage from Kaspar)
I am adding some notes, while reading about evaluation frameworks on WSD. Regarding hyponyms and synonyms of a given sense (say machine_sense1
), we should check whether each of these lemmas (say hyp_01
) has only one sense (monosemous). In that case the lemma hyp_01
and machine_sense1
are direct lexical substitutes and there is no sense ambiguity between the lemma of hyp_01
and machine_sense1
. We would need to either exclude them or evaluate them separately from more complex synonyms / hyponyms (so lemmas that have more than one sense and not all of them will be related to machine_sense1
). To know more, see Table 1, page 844 here and the relation between ("coke" and "Pepsi"); basically, if you have Pepsi
in a sentence, it will be way easier to predict the correct sense of coke
, because Pepsi is monosemous.
It may be useful to remind ourselves about what we said our tasks are (from #28 ):
Given a set of senses of a target lemma (e.g. machine001) from a historical dictionary+thesaurus (e.g. OED+HTOED), a time period (e.g. 1800-1914):
2.1 and 2.2 can, in fact, be seen as separate tasks as 2.2 builds on 2.1: first, we start from a polysemous lemma (e.g. machine) and a new sentence try to assign the right sense of this lemma in this sentence; this way, we build our knowledge of the sense profiles of this lemma; then, we can generalise to other lemmas by using these sense profiles.
We could start from 2.1, which is effectively a WSD task and design the evaluation based on this. BUT, we should avoid the temptation to just do yet another WSD paper because our unique selling point is the historical dimension, both in terms of corpus and dictionary. So, we could try and test the hypothesis that adding temporal information and historical-lexicographic information helps the WSD task.
@BarbaraMcG thanks! I completely agree on 2.1 with the selling point of historical context and temporal information
So, we could try and test the hypothesis that adding temporal information and historical-lexicographic information helps the WSD task.
I think this makes sense. In this case, the date of the quotation is one of the criteria we also should take into account to create the training and test set. I like the idea of having a general and generalizable WSD framework (with a historical focus both on the method and the evaluation), but using the machine as a case study.
Task: WSD with time dimension: given a lemma and a historical dictionary sense inventory, and a sentence containing that lemma, match the sentence with the correct sense of the lemma Variation (case study on machine): given a chosen sense, match a sentence with it or not (1 vs All)
Ways to integrate time:
TO DOs:
Some notes from our after-discussion. @kasra-hosseini @mcollardanuy please chip and edit this comment directly.
Upon reflection, we thought it better prioritize the binary (or targeted) sense-disambiguation task, instead of starting with general WSD.
Targeted sense disambiguation looks as follows: classify tokens in a text as "belonging to" a sense in the dictionary (or not). What we mean with "belonging to" is something we could discuss, but let's assume it means "token Y is equal to
sense X" or "token Y is synonymous
to sense X"
The procedure could look as follows (thinking of a simple baseline)
machine as structure
is a sense of the lemma machine_nn01
), we label the sense itself and all sense-synonyms and their quotations as1
, the rest as 0
. 1
(the vector representing machine as structure
) the rest is we combine into the other
vector.closer to machine
as structure or other
There are other ways, of course, this is just a baseline (that probably won't work very well).
Some comments:
Why targeted (binary) instead of general (multiclass) WSD:
To do:
Hi all and thanks for this! I agree that this is more in line with the final goal and the aggregation of senses would work well also with the crowdsourcing data (as they are annotating group of senses already).
If we all agree on this as the final task, @mcollardanuy and I can quickly re-adapt the eval framework while you finish the PR
Also noting down @kasra-hosseini idea to apply the intuition behind the SemAxis paper to find a contrastive machine vs not-machine dimension in the contextual embedding. https://arxiv.org/abs/1806.05521
The algorithm could look like:
Great! I'm totally convinced on the binary task. I have one question: how would the time dimension be incorporated?
@BarbaraMcG . Good point. I think we can still make it diachronic similar in the way we discussed it earlier We want to disambiguation a quotation for a given year, and therefore make the method time-sensitive. If this improves the accuracy remains to be seen, but let's hope it does!
Yes, I hope so too, but in any case it would be interesting to see.
Our paper focuses around two word sense disambiguation (WSD) tasks:
machine
) in a sentence and we have to predict which sense is the most appropriate, given a predifined list ofmachine
senses.locomotive
) which is related to (at least one) sense of another word (saymachine
) and the task is to predict which sense ofmachine
is more appropriate, given the wordlocomotive
in a sentence. The relation could be asynonym
, anhyponym
etc.Goal: to define a clear evaluation setting (train/test splitting) and evaluation metrics
TLDR (main points from the discussion below - to be updated):
Currently blocked by #2