ProjetPP / PPP-QuestionParsing-Grammatical

Question Parsing module for the PPP using a grammatical approch
GNU Affero General Public License v3.0
33 stars 11 forks source link

ConceptNet5 : nounification #77

Closed yhamoudi closed 9 years ago

yhamoudi commented 9 years ago

How to perform nounification using ConceptNet5

yhamoudi commented 9 years ago

First remarks:

yhamoudi commented 9 years ago

(by @Ezibenroc)

  • actually, if i'm not mistaken, the algorithm (...)

-> Good point. So we would need to cut the search in two parts: one where the searched word is on the left hand side, and one where it is on the right and side, depending on the relation. But our code does not make lemmatization: bagel returns 317 results whereas bagels returns 0 result. I don't know how they do this lemmatization.

When installing conceptnet on my computer, I saw a dependency with NLTK. Maybe they use its lemmatizer (which returns very good results).

  • how does the parameter 'limit': works (...)

limit is just the max number of returned results. It is very strange that we do not obtain this relation...

yhamoudi commented 9 years ago

the normalization has been added into the file (using the normalizing tool of conceptnet)

yhamoudi commented 9 years ago

Some progress:

So we have to find a way to extract all the edges that contain an endpoint that is exactly the input concept (and not all the edges for which the input concept is a prefix of one of the 2 endpoints). It would be good to perform queries using the optionnal arguments provided by Search (especially being able to select all the edges that contain a specified relation such as RelatedTo), and it's not sure we can do this with Lookup.

Ezibenroc commented 9 years ago

If we want to use normalized_concept_name('en', 'elected') to normalize "elected", maybe we should not use the API for the nounification, but do direct calls.

yhamoudi commented 9 years ago

why do you distinguish left and right relations (in https://github.com/ProjetPP/PPP-QuestionParsing-Grammatical/commit/64ec2cc6df2e40082f6407638f6b808bbb65e613) ? Most of the relations seems to be reflexive. Instead, we should test where is the input uri and then takes the other side of the edge.

(i push this in a second)

yhamoudi commented 9 years ago

it seems to be better to use /c/bla/bli than /c/bla/bli/ for an uri (i had an empty set with the second notation)

yhamoudi commented 9 years ago

some of the available fields: https://github.com/commonsense/conceptnet5/wiki/Edges

yhamoudi commented 9 years ago

the weight of an edge seems to be not relevant to find the better noun

yhamoudi commented 9 years ago

We can start thinking about how to choose what are the most relevant nodes. I propose the following algorithm :

The most our algorithm is good, the less parameter x has to be.

yhamoudi commented 9 years ago

now we keep only the words w that are nouns. In order to do this, i send the word w to the Stanford Parser and look at the POS tag (noun = NN). We can try to use another algo/library that performs only this task (and not all the parsing done by the stanford parser), it could be quicker (but i didn't find another tool, except nltk)

trick to parse faster (?): concatenate all the candidates w into a string s, perform only one parsing on s, and then look at each POS tag in the result

yhamoudi commented 9 years ago

hum, there is still a problem of prefix with Lookup method. For instance, i perform a request on elect on i don't success to obtain elect -DerivedFrom-> vote (that appears here http://conceptnet5.media.mit.edu/web/c/en/elect). Instead i've relations such as : /c/en/elect/v/select_by_a_vote_for_an_office_or_membership -DerivedFrom-> /c/en/election/n/a_vote_to_select_the_winner_of_a_position_or_political_office

even if i put the limit parameter to 1000, none of the DerivedFrom relations are good (they do not have /c/en/elect but /c/en/elec/... instead)

Ezibenroc commented 9 years ago

Instead i've relations such as : (...)

This is why I put the slash at the end.

Ezibenroc commented 9 years ago

concatenate all the candidates w into a string s, perform only one parsing on s, and then look at each POS tag in the result

The POS tag depends of the context (e.g. the word «fix» can be a noun or a verb), I think you can mess everything if you do like this...

yhamoudi commented 9 years ago

This is why I put the slash at the end.

? it doesn't change anything. However, according to this: https://github.com/commonsense/conceptnet5/wiki/URI-hierarchy#concept-uris it's not realy a question of prefix. The question is: what is the method used here http://conceptnet5.media.mit.edu/web/c/en/elect and how to get the same result ?

yhamoudi commented 9 years ago

The POS tag depends of the context (e.g. the word «fix» can be a noun or a verb), I think you can mess everything if you do like this...

yes but we don't have any context. Moreover i think that the POS tagger (stanford parser at least) gives a tag NN in case of ambiguity. So it's the best we can do.

Ezibenroc commented 9 years ago

similarity between w and w0

I saw something in conceptnet to have a score of similarity between two words (certainly semantic similarity, not spelling). I think it was in association.

yhamoudi commented 9 years ago

I saw something in conceptnet to have a score of similarity between two words (certainly semantic similarity, not spelling). I think it was in association.

yes, probably the same thing than the weight. But we can give it a small part part in the score if necessary.

Concerning the POS tagger: it seems normal to tag a single word (ie a word that appears in a sentence with only 1 word) NN (a sentence with a single word that is a verb is more strange). However, if we concatenate all the words w to obtain their POS tag in only one parsing, there is a risk that some of them are interpreted as verbs (ex: elector fix vote).

I think i've understood this problem of prefix. With our algo we obtain: /c/en/elect/v/select_by_a_vote_for_an_office_or_membership -DerivedFrom-> /c/en/voter/n/a_citizen_who_has_a_legal_right_to_vote According to https://github.com/commonsense/conceptnet5/wiki/URI-hierarchy#concept-uris, all the info after the fourth / (here: /v/select_by_a_vote_for_an_office_or_membership and /n/a_citizen_who_has_a_legal_right_to_vote) are additionnal optionnal info. If we remove them we have elect -DerivedFrom -> voter that appears here: http://conceptnet5.media.mit.edu/web/c/en/elect

yhamoudi commented 9 years ago

previous thing dones. Now we should really have the same result than on the demo website

Ezibenroc commented 9 years ago

Quick test on this list of 20 verbs: ['die','born','wrote','directed','play','ran','jump','walk','hide','dive','drive','fall','climb','ride','dance','wash','cook','repair','build','fly']

Replace the end of the code by:

if __name__ == "__main__":
    for foo in ['die','born','wrote','directed','play','ran','jump','walk','hide','dive','drive','fall','climb','ride','dance','wash','cook','repair','build','fly']:
        word=normalize(default_language,foo)
        uri = "/c/{0}/{1}".format(default_language,word)
        print(associatedWords(uri,word,{'/r/RelatedTo','/r/DerivedFrom'}))

Then, we have this time to run the script:

real  1m25.904s
user  0m3.772s
sys   0m1.612s

The huge difference between real and user+sys means that the CPU is often idle (certainly waiting for an I/O in the database). 85.904s of real spent time means 4.3s per word, which is way to slow...

Ezibenroc commented 9 years ago

Unless we achieve to improve this time (by at least a factor 10), this solution is not feasible, and we should keep NLTK.

yhamoudi commented 9 years ago

it's probably due to the use of the stanford parser, naturally we cannot keep the current part of the algo that makes a call to the stanford parser on every single candidate. Perhaps we can also speed up conceptnet by running our own server.

(i prefer a correct but slow algorithm than an algo with really poor results but fast)

yhamoudi commented 9 years ago

i'm rewritting the structure of the file conceptnet_local, please do not push big changes

progval commented 9 years ago

I think your issue is you face the same issue as the Wikidata and HAL modules: you have twenty requests and you make them one at a time. There are two solutions:

However, both of them require you to schedule requests in advance, which is sometimes tricky to implement.

Ezibenroc commented 9 years ago

I did more precise measures with https://github.com/ProjetPP/PPP-QuestionParsing-Grammatical/commit/f7138b856b1273455b9dd2d40da5d09f6dda37fe

Now it takes 10s to 20s to handle a single word. Without surprise, all the time is spent in the elimination of the candidates, and in particular the queries to the Stanford library.

yhamoudi commented 9 years ago

i increased the limit to 350 candidates so it's normal (but it's not definitive, it was just to make some tests).

What are the longest operations: 1 - querying conceptnet 2 - perform elimination (excepting with the stanford parser) 3 - perform pos tag with the stanford parser

For the first 1, we have to use a server that runs locally before saying anything. It should be normal that querying conceptnet as we do now is longer than using a server.

For steps 3, it's temporary. If it's efficient, we can re-implement the part of the stanford parser that perform pos tag.

It would be strange that step 2 takes so long time.

Ezibenroc commented 9 years ago

Maybe an interesting link about POS tagging (did not read it entirely) : https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/

progval commented 9 years ago

FYI, I tried speeding it up by running requests in parallel:

@@ -132,18 +133,27 @@ def buildCandidate(pattern,edge):
     else:
         return None
+import multiprocessing
+import uuid
+g_pattern = {}
+def f(x):
+    (foo, e) = x
+    pattern = g_pattern[foo]
+    cand = buildCandidate(pattern,e)
+    if cand != None and cand.tag != -1:
+        return cand
 def associatedWords(pattern,relations):
     uri = "/c/{0}/{1}".format(default_language,pattern)
     r = list(lookup(uri,limit=350))
     CLOCK.time_step("lookup")
     #for e in r:
     #    print(e['start'] + ' ' + e['rel'] + ' ' + e['end'])
-    res = []
-    for e in r:
-        if e['rel'] in relations:
-            cand = buildCandidate(pattern,e)
-            if cand != None and cand.tag != -1:
-                res.append(cand)
+    foo = uuid.uuid4()
+    g_pattern[foo] = pattern
+    l = [(foo, e) for e in r if e['rel'] in relations]
+    with multiprocessing.Pool(2) as p:
+        res = p.map(f, l)
+    del g_pattern[foo]
+    res = list(filter(bool, res))
     #for cand in res:
     #    print(cand.word + ' ' + str(cand.weight))
     CLOCK.time_step("buildCandidate")

But the execution time is exactly the same

yhamoudi commented 9 years ago

I've just added a new demo file conceptnet_server.py that makes queries to conceptnet using a server.

How to use it:

I think it's quicker now (and still a lot of improvements are possible).

@ProgVal when they say to set up a WSGI server here: https://github.com/commonsense/conceptnet5/wiki/Running-your-own-copy what does it means to be "more robust"?

progval commented 9 years ago

It means that HTTP servers written in a naive way are very unefficient and can be easily DoSed. Implementing a WSGI interface allows you to provide a service through a well-written HTTP server (nginx, lighttpd, Apache (more arguably), …)

Ezibenroc commented 9 years ago

I think a POS tagger is not the best for what we want, since it is used for context-dependant tagging.

A simple dictionnary which would map each word to a set of possible parts of speach would be quicker. Then, we would keep the word only if this is possibly a noun.

Unfortunately, I do not find a good multilingual dictionnary providing the part of speech (it seems Aspell does not do that).

yhamoudi commented 9 years ago

i've just throw a "hook" on stackoverflow about this question. Let's other people think about it...

Ezibenroc commented 9 years ago

Can you give the link?

yhamoudi commented 9 years ago

http://stackoverflow.com/questions/28033882/determining-wheter-a-word-is-a-noun-or-not

yhamoudi commented 9 years ago

good idea, i post questions on stackoverflow and you top them :)

Ezibenroc commented 9 years ago

Phase1: all PPP members must have a high reputation on Stackoverflow. Phase2: platypus proselytism can begin.

yhamoudi commented 9 years ago

I DO NOT WANT NLTK http://stackoverflow.com/a/28034218/3476917

yhamoudi commented 9 years ago

how many nouns can we expect in english? (and how fast it is to perform search into the set of all the nouns?)

Ezibenroc commented 9 years ago

This proposition uses NLTK only for precomputing purpose. I find it quite good.

how many nouns can we expect in english? (and how fast it is to perform search into the set of all the nouns?)

Not only in english. We would need such a set for each supported language.

yhamoudi commented 9 years ago

yes, it's not really about NLTK. But is it realistic to perform search in a so big set?

yhamoudi commented 9 years ago

do you success to run its algorithm? i obtain AttributeError: 'function' object has no attribute 'split' after running the second line

Ezibenroc commented 9 years ago

He/she forgot parenthesis after name: nouns = {x.name().split('.', 1)[0] for x in wn.all_synsets('n')}

yhamoudi commented 9 years ago

67176 elements

yhamoudi commented 9 years ago

it's fast but don't we lose a lot of nouns? (only 67176 nouns in english?)

We could do the opposite thing:

Ezibenroc commented 9 years ago

That's small, log(67k) < 17. And a lot of words seems to be in this list: in the list of 20 verbs I gave, only 4 of them are not in the list. It seems ok to me.

yhamoudi commented 9 years ago

what are the words that don't occur in the set?

Ezibenroc commented 9 years ago

['wrote', 'directed', 'ran', 'build']

yhamoudi commented 9 years ago

but it's not nouns?

Ezibenroc commented 9 years ago

I don't think so. Check in a dictionnary to be sure.

Ezibenroc commented 9 years ago

https://github.com/ProjetPP/PPP-QuestionParsing-Grammatical/commit/24ebdb9cf04c7b93509ee15d0dd29f1e11d2abce

Tested on 207 words. Null time to handle all these words. 91 of them were said to be nouns.