asoroa / ukb

Ukb: graph-based WSD and similarity
106 stars 28 forks source link

partial disambiguated corpus #14

Closed arademaker closed 2 years ago

arademaker commented 2 years ago

Can I say to UKB that some words from a context were already disambiguated? So that UKB can use the senses already defined and just decide the senses for the remaining ambiguous words?

asoroa commented 2 years ago

Yes, the input context can contain "concept" nodes that activate specific senses. See "3. Input context" in

https://github.com/asoroa/ukb/blob/master/src/README

Admitelly the explanation in the README could be improved. If you have some example in mind I can help you creating the proper context.

arademaker commented 2 years ago

Thank you @asoroa , thank you for calling my attention to https://github.com/asoroa/ukb/blob/master/src/README#L278-L288, I haven't paid attention to the values above 1 for the fourth field before.

arademaker commented 2 years ago

if I got it right, values 3 and 4 may be used for situations where I may have more than one possible sense already defined to a given word, am I right?

asoroa commented 2 years ago

yes, that is correct.

arademaker commented 2 years ago

I have been completing the Glsostag corpus see http://arademaker.github.io/bibliography/gwc-2019-glosstag.html. I want to use UKB to: 1) complete the glosstag corpus; 2) evaluate how the completion of glosstag improves UKB results.

In http://wn.mybluemix.net/synset?id=02431834-v, the example is China broke with Russia. Both China and Russia are ambiguous in such a small context (actually most examples and definitions are too small to WSD, even by humans). So I would like to give UKB the input with values 3 or 4 for China and Russia but I really didn't get the difference between 3 and 4 and how to use them... any idea?

arademaker commented 2 years ago

In particular, what does it means to have term not used by PR but disambiguated?

3: the term is not used in PageRank calculation but is disambiguated...meaning that we want do disambiguate 'man' but we don't want to activate in PageRank, because all its concepts are already in the context with the desired weights.

I really didn't understand "not activate in PR because all its concepts are already in the context with weights... "

arademaker commented 2 years ago

For

4: the term is not used in PageRank calculation and is not disambiguated. This is like '3' above, and is used for grouping 'concept' elements (also, we can attach a weight to this '4' element).

So this is similar to ignoring the word? BTW, I just realized that ZERO is not for ignoring since the word will be used in the PR calculation, and from the examples, now I get the you normally remove the functional words, right? So articles, prepositions, conjunctions, punctuations, etc are all removed, right?

asoroa commented 2 years ago

short answer: just use '3'

Long answer:

When nodes are activated in PR calculation, each one of them gets an initial weigth. The weigth is calculated as a multiplication of word weigths and concept weigths. So, an input context such as:

China##id1#3#2 n1##id2#2#8 n2##id3#2#2 Russia##id4#3#7 n4##id5#2#5 n5##id6#2#5

Would activate nodes n1, n2, n3 and n4 with the following weigths:

For example, weight of n1 comes from multiplying the relative weight of 'China' (2/9) times the relative weight of n1 (8/10). After PR, and because both 'China' and 'Russia' have control code '3', ukb will output the best node (sense) for each of the words.

This context:

China##id1#3#2 n1##id2#2#8 n2##id3#2#2 Russia##id4#4#7 n4##id5#2#5 n5##id6#2#5

would activate the same nodes with the exact same weights, but ukb will output nothing for 'Russia' (because it's control code is '4').

Hope this helps.

arademaker commented 2 years ago

nice! Thank you so much for your help!

So, just to confirm, we don't have a code for non-content words (articles, prepositions, etc), right? They can be just removed from the context. Of course, we all know that prepositions are relevant to the semantics of verbs and the selection of their complements (their valence if you will) ... But I guess UKB is agnostic to that discussion. If a KB includes prepositions, nodes for them can be used by UKB... So it is all about what KB you decide to use, right?

asoroa commented 2 years ago

yup, it all depends on the KB you are using.