ConceptNet5 : nounification

yhamoudi commented 9 years ago

How to perform nounification using ConceptNet5

ConceptNet wiki: https://github.com/commonsense/conceptnet5/wiki
ConceptNet website: http://conceptnet5.media.mit.edu/
ConceptNet repository: https://github.com/commonsense/conceptnet5
Official mailing list: https://groups.google.com/forum/#!forum/conceptnet-users
Sandbox : https://github.com/ProjetPP/PPP-QuestionParsing-Grammatical/blob/conceptnet/demo/conceptnet.py
Lemmatizer: https://github.com/commonsense/conceptnet5/blob/master/conceptnet5/nodes.py (normalized_concept_name)

yhamoudi commented 9 years ago

First remarks:

actually, if i'm not mistaken, the algorithm (in the sandbox) takes a word w as input, brings back all the related edges and looks at the 2 endpoint of each edge to determine where is the input word w and where is the related information. However, conceptnet performs lemmatization (https://groups.google.com/forum/#!topic/conceptnet-users/mbt6uprdJ4I) so the input word w can be modified by conceptnet = we obtain edges which endpoints don't contain the input word w. It's a problem because we do not choose the right endpoint if the word has been lematized
how does the parameter 'limit': works (and how to get the same result than in the online demo)? For instance, it seems that elector is RelatedTo elect (http://conceptnet5.media.mit.edu/web/c/en/elect). I try to obtain it: i remove if w['score']/r['maxScore']>=0.5] and i increase limit but elector is never output

yhamoudi commented 9 years ago

(by @Ezibenroc)

actually, if i'm not mistaken, the algorithm (...)

-> Good point. So we would need to cut the search in two parts: one where the searched word is on the left hand side, and one where it is on the right and side, depending on the relation. But our code does not make lemmatization: bagel returns 317 results whereas bagels returns 0 result. I don't know how they do this lemmatization.

When installing conceptnet on my computer, I saw a dependency with NLTK. Maybe they use its lemmatizer (which returns very good results).

how does the parameter 'limit': works (...)

limit is just the max number of returned results. It is very strange that we do not obtain this relation...

yhamoudi commented 9 years ago

the normalization has been added into the file (using the normalizing tool of conceptnet)

yhamoudi commented 9 years ago

Some progress:

most of the available info about the API are here: https://github.com/commonsense/conceptnet5/wiki/API
about vocabulary: in conceptnet each information is represented by an URI. For instance, the concept elect is represented by the URI /c/en/elect. The relation related to by the URI /r/RelatedTo. When you enter a word into their website (elected for instance) it is mapped to an existing concept (elect) by normalization (mostly lemmatization + stemming). The distinction between words (string) and URI is important because some tools provided by conceptnet takes a word as input, ant others take URI
we can see that there are 3 ways of querying: Lookup, Search and Association.
- Lookup: it seems to be what is displayed on their website: http://conceptnet5.media.mit.edu/web/c/en/elect. It gives you all the edges with an endpoint that is exactly the URI obtained from your input word. It's what we are looking for! However, I don't success to use it (for instance, replacing api='http://conceptnet5.media.mit.edu/data/5.2/search' by api='http://conceptnet5.media.mit.edu/data/5.2' in our code doesn't work). Moreover, according to the documentation there are only 3 additionnal arguments available to get info with this method (limit, offset, filter)
- Search: (it's what we use currently in our code) a lot of additionnal arguments available. Some of them need URI and other strings (words). The most important thing: it seems that both words and URI given as input are considered as prefix. For instance, if i perform a query using the argument text : i give a word as input (elect for instance) and then conceptnet returns all the edges with an endpoint (or a relation) starting by elect. This is why the edge elect -RelatedTo -> elector is difficult to obtain : conceptnet considers that power -RelatedTo-> electricity is more relevant for instance.
- Association: seems not interessant for the moment

So we have to find a way to extract all the edges that contain an endpoint that is exactly the input concept (and not all the edges for which the input concept is a prefix of one of the 2 endpoints). It would be good to perform queries using the optionnal arguments provided by Search (especially being able to select all the edges that contain a specified relation such as RelatedTo), and it's not sure we can do this with Lookup.

Ezibenroc commented 9 years ago

If we want to use normalized_concept_name('en', 'elected') to normalize "elected", maybe we should not use the API for the nounification, but do direct calls.

yhamoudi commented 9 years ago

why do you distinguish left and right relations (in https://github.com/ProjetPP/PPP-QuestionParsing-Grammatical/commit/64ec2cc6df2e40082f6407638f6b808bbb65e613) ? Most of the relations seems to be reflexive. Instead, we should test where is the input uri and then takes the other side of the edge.

(i push this in a second)

yhamoudi commented 9 years ago

it seems to be better to use /c/bla/bli than /c/bla/bli/ for an uri (i had an empty set with the second notation)

yhamoudi commented 9 years ago

some of the available fields: https://github.com/commonsense/conceptnet5/wiki/Edges

yhamoudi commented 9 years ago

the weight of an edge seems to be not relevant to find the better noun

yhamoudi commented 9 years ago

We can start thinking about how to choose what are the most relevant nodes. I propose the following algorithm :

extract the x first edges with a relation in {RelatedTo, DerivedFrom...} (we have to determine the most relevant relations). For each edge, we denote by w the interesting word (ie the node that is different from the input word w0)
do not consider the edges such that :
- w is not in english (for the moment)
- w is not a noun
- ...
for each edge, give a score to w. This score has to be a mix of several sub-scores :
- similarity between w and w0 (for instance similarity(elect,elector) > similarity(elect,vote). We can use the levenshtein distance to do this. For the moment, i use a function from difflib library (i don't know what is the algorithm used)
- the weight of the edge ? (but it seems really not relevant)
- the number of times w is used in "the language of everyday life"
- ...

The most our algorithm is good, the less parameter x has to be.

yhamoudi commented 9 years ago

now we keep only the words w that are nouns. In order to do this, i send the word w to the Stanford Parser and look at the POS tag (noun = NN). We can try to use another algo/library that performs only this task (and not all the parsing done by the stanford parser), it could be quicker (but i didn't find another tool, except nltk)

trick to parse faster (?): concatenate all the candidates w into a string s, perform only one parsing on s, and then look at each POS tag in the result

yhamoudi commented 9 years ago

hum, there is still a problem of prefix with Lookup method. For instance, i perform a request on elect on i don't success to obtain elect -DerivedFrom-> vote (that appears here http://conceptnet5.media.mit.edu/web/c/en/elect). Instead i've relations such as : /c/en/elect/v/select_by_a_vote_for_an_office_or_membership -DerivedFrom-> /c/en/election/n/a_vote_to_select_the_winner_of_a_position_or_political_office

even if i put the limit parameter to 1000, none of the DerivedFrom relations are good (they do not have /c/en/elect but /c/en/elec/... instead)

Ezibenroc commented 9 years ago

Instead i've relations such as : (...)

This is why I put the slash at the end.

Ezibenroc commented 9 years ago

concatenate all the candidates w into a string s, perform only one parsing on s, and then look at each POS tag in the result

The POS tag depends of the context (e.g. the word «fix» can be a noun or a verb), I think you can mess everything if you do like this...

yhamoudi commented 9 years ago

This is why I put the slash at the end.

? it doesn't change anything. However, according to this: https://github.com/commonsense/conceptnet5/wiki/URI-hierarchy#concept-uris it's not realy a question of prefix. The question is: what is the method used here http://conceptnet5.media.mit.edu/web/c/en/elect and how to get the same result ?

yhamoudi commented 9 years ago

The POS tag depends of the context (e.g. the word «fix» can be a noun or a verb), I think you can mess everything if you do like this...

yes but we don't have any context. Moreover i think that the POS tagger (stanford parser at least) gives a tag NN in case of ambiguity. So it's the best we can do.

Ezibenroc commented 9 years ago

similarity between w and w0

I saw something in conceptnet to have a score of similarity between two words (certainly semantic similarity, not spelling). I think it was in association.

yhamoudi commented 9 years ago

I saw something in conceptnet to have a score of similarity between two words (certainly semantic similarity, not spelling). I think it was in association.

yes, probably the same thing than the weight. But we can give it a small part part in the score if necessary.

Concerning the POS tagger: it seems normal to tag a single word (ie a word that appears in a sentence with only 1 word) NN (a sentence with a single word that is a verb is more strange). However, if we concatenate all the words w to obtain their POS tag in only one parsing, there is a risk that some of them are interpreted as verbs (ex: elector fix vote).

I think i've understood this problem of prefix. With our algo we obtain: /c/en/elect/v/select_by_a_vote_for_an_office_or_membership -DerivedFrom-> /c/en/voter/n/a_citizen_who_has_a_legal_right_to_vote According to https://github.com/commonsense/conceptnet5/wiki/URI-hierarchy#concept-uris, all the info after the fourth / (here: /v/select_by_a_vote_for_an_office_or_membership and /n/a_citizen_who_has_a_legal_right_to_vote) are additionnal optionnal info. If we remove them we have elect -DerivedFrom -> voter that appears here: http://conceptnet5.media.mit.edu/web/c/en/elect

yhamoudi commented 9 years ago

previous thing dones. Now we should really have the same result than on the demo website

Ezibenroc commented 9 years ago

Quick test on this list of 20 verbs: ['die','born','wrote','directed','play','ran','jump','walk','hide','dive','drive','fall','climb','ride','dance','wash','cook','repair','build','fly']

Replace the end of the code by:

if __name__ == "__main__":
    for foo in ['die','born','wrote','directed','play','ran','jump','walk','hide','dive','drive','fall','climb','ride','dance','wash','cook','repair','build','fly']:
        word=normalize(default_language,foo)
        uri = "/c/{0}/{1}".format(default_language,word)
        print(associatedWords(uri,word,{'/r/RelatedTo','/r/DerivedFrom'}))

Then, we have this time to run the script:

real  1m25.904s
user  0m3.772s
sys   0m1.612s

The huge difference between real and user+sys means that the CPU is often idle (certainly waiting for an I/O in the database). 85.904s of real spent time means 4.3s per word, which is way to slow...

Ezibenroc commented 9 years ago

Unless we achieve to improve this time (by at least a factor 10), this solution is not feasible, and we should keep NLTK.

yhamoudi commented 9 years ago

it's probably due to the use of the stanford parser, naturally we cannot keep the current part of the algo that makes a call to the stanford parser on every single candidate. Perhaps we can also speed up conceptnet by running our own server.

(i prefer a correct but slow algorithm than an algo with really poor results but fast)

yhamoudi commented 9 years ago

i'm rewritting the structure of the file conceptnet_local, please do not push big changes

progval commented 9 years ago

I think your issue is you face the same issue as the Wikidata and HAL modules: you have twenty requests and you make them one at a time. There are two solutions:

run them in parallel (solution that will be implemented in Wikidata)
aggregate requests, if the API allows it (which will be done in HAL)

However, both of them require you to schedule requests in advance, which is sometimes tricky to implement.

Ezibenroc commented 9 years ago

I did more precise measures with https://github.com/ProjetPP/PPP-QuestionParsing-Grammatical/commit/f7138b856b1273455b9dd2d40da5d09f6dda37fe

Now it takes 10s to 20s to handle a single word. Without surprise, all the time is spent in the elimination of the candidates, and in particular the queries to the Stanford library.

yhamoudi commented 9 years ago

i increased the limit to 350 candidates so it's normal (but it's not definitive, it was just to make some tests).

What are the longest operations: 1 - querying conceptnet 2 - perform elimination (excepting with the stanford parser) 3 - perform pos tag with the stanford parser

For the first 1, we have to use a server that runs locally before saying anything. It should be normal that querying conceptnet as we do now is longer than using a server.

For steps 3, it's temporary. If it's efficient, we can re-implement the part of the stanford parser that perform pos tag.

It would be strange that step 2 takes so long time.

Ezibenroc commented 9 years ago

Maybe an interesting link about POS tagging (did not read it entirely) : https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/

progval commented 9 years ago

FYI, I tried speeding it up by running requests in parallel:

@@ -132,18 +133,27 @@ def buildCandidate(pattern,edge):
     else:
         return None
+import multiprocessing
+import uuid
+g_pattern = {}
+def f(x):
+    (foo, e) = x
+    pattern = g_pattern[foo]
+    cand = buildCandidate(pattern,e)
+    if cand != None and cand.tag != -1:
+        return cand
 def associatedWords(pattern,relations):
     uri = "/c/{0}/{1}".format(default_language,pattern)
     r = list(lookup(uri,limit=350))
     CLOCK.time_step("lookup")
     #for e in r:
     #    print(e['start'] + ' ' + e['rel'] + ' ' + e['end'])
-    res = []
-    for e in r:
-        if e['rel'] in relations:
-            cand = buildCandidate(pattern,e)
-            if cand != None and cand.tag != -1:
-                res.append(cand)
+    foo = uuid.uuid4()
+    g_pattern[foo] = pattern
+    l = [(foo, e) for e in r if e['rel'] in relations]
+    with multiprocessing.Pool(2) as p:
+        res = p.map(f, l)
+    del g_pattern[foo]
+    res = list(filter(bool, res))
     #for cand in res:
     #    print(cand.word + ' ' + str(cand.weight))
     CLOCK.time_step("buildCandidate")

But the execution time is exactly the same

yhamoudi commented 9 years ago

I've just added a new demo file conceptnet_server.py that makes queries to conceptnet using a server.

How to use it:

run your instance of the stanford parser
run a conceptnet server : python3 -m conceptnet5.api
run ./conceptnet_server.py banana

I think it's quicker now (and still a lot of improvements are possible).

@ProgVal when they say to set up a WSGI server here: https://github.com/commonsense/conceptnet5/wiki/Running-your-own-copy what does it means to be "more robust"?

progval commented 9 years ago

It means that HTTP servers written in a naive way are very unefficient and can be easily DoSed. Implementing a WSGI interface allows you to provide a service through a well-written HTTP server (nginx, lighttpd, Apache (more arguably), …)

Ezibenroc commented 9 years ago

I think a POS tagger is not the best for what we want, since it is used for context-dependant tagging.

A simple dictionnary which would map each word to a set of possible parts of speach would be quicker. Then, we would keep the word only if this is possibly a noun.

Unfortunately, I do not find a good multilingual dictionnary providing the part of speech (it seems Aspell does not do that).

yhamoudi commented 9 years ago

i've just throw a "hook" on stackoverflow about this question. Let's other people think about it...

Ezibenroc commented 9 years ago

Can you give the link?

yhamoudi commented 9 years ago

http://stackoverflow.com/questions/28033882/determining-wheter-a-word-is-a-noun-or-not

yhamoudi commented 9 years ago

good idea, i post questions on stackoverflow and you top them :)

Ezibenroc commented 9 years ago

Phase1: all PPP members must have a high reputation on Stackoverflow. Phase2: platypus proselytism can begin.

yhamoudi commented 9 years ago

I DO NOT WANT NLTK http://stackoverflow.com/a/28034218/3476917

yhamoudi commented 9 years ago

how many nouns can we expect in english? (and how fast it is to perform search into the set of all the nouns?)

Ezibenroc commented 9 years ago

This proposition uses NLTK only for precomputing purpose. I find it quite good.

how many nouns can we expect in english? (and how fast it is to perform search into the set of all the nouns?)

Not only in english. We would need such a set for each supported language.

yhamoudi commented 9 years ago

yes, it's not really about NLTK. But is it realistic to perform search in a so big set?

yhamoudi commented 9 years ago

do you success to run its algorithm? i obtain AttributeError: 'function' object has no attribute 'split' after running the second line

Ezibenroc commented 9 years ago

He/she forgot parenthesis after name: nouns = {x.name().split('.', 1)[0] for x in wn.all_synsets('n')}

yhamoudi commented 9 years ago

67176 elements

yhamoudi commented 9 years ago

it's fast but don't we lose a lot of nouns? (only 67176 nouns in english?)

We could do the opposite thing:

extract all the words that are not nouns
remove from them all the words that are also nouns
given an input word w, if it doesn't appear in our set we say it's a noun (thus, we do not lose the "weird" nouns that are not in nltk)

Ezibenroc commented 9 years ago

That's small, log(67k) < 17. And a lot of words seems to be in this list: in the list of 20 verbs I gave, only 4 of them are not in the list. It seems ok to me.

yhamoudi commented 9 years ago

what are the words that don't occur in the set?

Ezibenroc commented 9 years ago

['wrote', 'directed', 'ran', 'build']

yhamoudi commented 9 years ago

but it's not nouns?

Ezibenroc commented 9 years ago

I don't think so. Check in a dictionnary to be sure.

Ezibenroc commented 9 years ago

https://github.com/ProjetPP/PPP-QuestionParsing-Grammatical/commit/24ebdb9cf04c7b93509ee15d0dd29f1e11d2abce

Tested on 207 words. Null time to handle all these words. 91 of them were said to be nouns.

ProjetPP / PPP-QuestionParsing-Grammatical

ConceptNet5 : nounification #77