Closed yhamoudi closed 9 years ago
First remarks:
'limit':
works (and how to get the same result than in the online demo)? For instance, it seems that elector
is RelatedTo
elect
(http://conceptnet5.media.mit.edu/web/c/en/elect). I try to obtain it: i remove if w['score']/r['maxScore']>=0.5]
and i increase limit
but elector
is never output(by @Ezibenroc)
- actually, if i'm not mistaken, the algorithm (...)
-> Good point. So we would need to cut the search in two parts: one where the searched word is on the left hand side, and one where it is on the right and side, depending on the relation. But our code does not make lemmatization: bagel returns 317 results whereas bagels returns 0 result. I don't know how they do this lemmatization.
When installing conceptnet on my computer, I saw a dependency with NLTK. Maybe they use its lemmatizer (which returns very good results).
- how does the parameter 'limit': works (...)
limit
is just the max number of returned results. It is very strange that we do not obtain this relation...
the normalization has been added into the file (using the normalizing tool of conceptnet)
Some progress:
elect
is represented by the URI /c/en/elect
. The relation related to
by the URI /r/RelatedTo
. When you enter a word into their website (elected
for instance) it is mapped to an existing concept (elect
) by normalization (mostly lemmatization + stemming). The distinction between words (string) and URI is important because some tools provided by conceptnet takes a word as input, ant others take URIapi='http://conceptnet5.media.mit.edu/data/5.2/search'
by api='http://conceptnet5.media.mit.edu/data/5.2'
in our code doesn't work). Moreover, according to the documentation there are only 3 additionnal arguments available to get info with this method (limit, offset, filter)text
: i give a word as input (elect
for instance) and then conceptnet returns all the edges with an endpoint (or a relation) starting by elect
. This is why the edge elect -RelatedTo -> elector
is difficult to obtain : conceptnet considers that power -RelatedTo-> electricity
is more relevant for instance.So we have to find a way to extract all the edges that contain an endpoint that is exactly the input concept (and not all the edges for which the input concept is a prefix of one of the 2 endpoints). It would be good to perform queries using the optionnal arguments provided by Search (especially being able to select all the edges that contain a specified relation such as RelatedTo
), and it's not sure we can do this with Lookup.
If we want to use normalized_concept_name('en', 'elected')
to normalize "elected", maybe we should not use the API for the nounification, but do direct calls.
why do you distinguish left and right relations (in https://github.com/ProjetPP/PPP-QuestionParsing-Grammatical/commit/64ec2cc6df2e40082f6407638f6b808bbb65e613) ? Most of the relations seems to be reflexive. Instead, we should test where is the input uri and then takes the other side of the edge.
(i push this in a second)
it seems to be better to use /c/bla/bli
than /c/bla/bli/
for an uri (i had an empty set with the second notation)
some of the available fields: https://github.com/commonsense/conceptnet5/wiki/Edges
the weight of an edge seems to be not relevant to find the better noun
We can start thinking about how to choose what are the most relevant nodes. I propose the following algorithm :
x
first edges with a relation in {RelatedTo, DerivedFrom...}
(we have to determine the most relevant relations). For each edge, we denote by w
the interesting word (ie the node that is different from the input word w0
)w
is not in english (for the moment)w
is not a nounw
. This score has to be a mix of several sub-scores :
w
and w0
(for instance similarity(elect,elector) > similarity(elect,vote)
. We can use the levenshtein distance to do this. For the moment, i use a function from difflib library (i don't know what is the algorithm used)w
is used in "the language of everyday life"The most our algorithm is good, the less parameter x
has to be.
now we keep only the words w
that are nouns. In order to do this, i send the word w
to the Stanford Parser and look at the POS tag (noun = NN). We can try to use another algo/library that performs only this task (and not all the parsing done by the stanford parser), it could be quicker (but i didn't find another tool, except nltk)
trick to parse faster (?): concatenate all the candidates w into a string s, perform only one parsing on s, and then look at each POS tag in the result
hum, there is still a problem of prefix with Lookup
method. For instance, i perform a request on elect
on i don't success to obtain elect -DerivedFrom-> vote
(that appears here http://conceptnet5.media.mit.edu/web/c/en/elect). Instead i've relations such as : /c/en/elect/v/select_by_a_vote_for_an_office_or_membership -DerivedFrom-> /c/en/election/n/a_vote_to_select_the_winner_of_a_position_or_political_office
even if i put the limit
parameter to 1000, none of the DerivedFrom
relations are good (they do not have /c/en/elect
but /c/en/elec/...
instead)
Instead i've relations such as : (...)
This is why I put the slash at the end.
concatenate all the candidates w into a string s, perform only one parsing on s, and then look at each POS tag in the result
The POS tag depends of the context (e.g. the word «fix» can be a noun or a verb), I think you can mess everything if you do like this...
This is why I put the slash at the end.
? it doesn't change anything. However, according to this: https://github.com/commonsense/conceptnet5/wiki/URI-hierarchy#concept-uris it's not realy a question of prefix. The question is: what is the method used here http://conceptnet5.media.mit.edu/web/c/en/elect and how to get the same result ?
The POS tag depends of the context (e.g. the word «fix» can be a noun or a verb), I think you can mess everything if you do like this...
yes but we don't have any context. Moreover i think that the POS tagger (stanford parser at least) gives a tag NN
in case of ambiguity. So it's the best we can do.
similarity between w and w0
I saw something in conceptnet to have a score of similarity between two words (certainly semantic similarity, not spelling). I think it was in association
.
I saw something in conceptnet to have a score of similarity between two words (certainly semantic similarity, not spelling). I think it was in
association
.
yes, probably the same thing than the weight. But we can give it a small part part in the score if necessary.
Concerning the POS tagger: it seems normal to tag a single word (ie a word that appears in a sentence with only 1 word) NN (a sentence with a single word that is a verb is more strange). However, if we concatenate all the words w to obtain their POS tag in only one parsing, there is a risk that some of them are interpreted as verbs (ex: elector fix vote
).
I think i've understood this problem of prefix. With our algo we obtain: /c/en/elect/v/select_by_a_vote_for_an_office_or_membership -DerivedFrom-> /c/en/voter/n/a_citizen_who_has_a_legal_right_to_vote
According to https://github.com/commonsense/conceptnet5/wiki/URI-hierarchy#concept-uris, all the info after the fourth / (here: /v/select_by_a_vote_for_an_office_or_membership
and /n/a_citizen_who_has_a_legal_right_to_vote
) are additionnal optionnal info. If we remove them we have elect -DerivedFrom -> voter
that appears here: http://conceptnet5.media.mit.edu/web/c/en/elect
previous thing dones. Now we should really have the same result than on the demo website
Quick test on this list of 20 verbs: ['die','born','wrote','directed','play','ran','jump','walk','hide','dive','drive','fall','climb','ride','dance','wash','cook','repair','build','fly']
Replace the end of the code by:
if __name__ == "__main__":
for foo in ['die','born','wrote','directed','play','ran','jump','walk','hide','dive','drive','fall','climb','ride','dance','wash','cook','repair','build','fly']:
word=normalize(default_language,foo)
uri = "/c/{0}/{1}".format(default_language,word)
print(associatedWords(uri,word,{'/r/RelatedTo','/r/DerivedFrom'}))
Then, we have this time to run the script:
real 1m25.904s
user 0m3.772s
sys 0m1.612s
The huge difference between real and user+sys means that the CPU is often idle (certainly waiting for an I/O in the database). 85.904s of real spent time means 4.3s per word, which is way to slow...
Unless we achieve to improve this time (by at least a factor 10), this solution is not feasible, and we should keep NLTK.
it's probably due to the use of the stanford parser, naturally we cannot keep the current part of the algo that makes a call to the stanford parser on every single candidate. Perhaps we can also speed up conceptnet by running our own server.
(i prefer a correct but slow algorithm than an algo with really poor results but fast)
i'm rewritting the structure of the file conceptnet_local
, please do not push big changes
I think your issue is you face the same issue as the Wikidata and HAL modules: you have twenty requests and you make them one at a time. There are two solutions:
However, both of them require you to schedule requests in advance, which is sometimes tricky to implement.
I did more precise measures with https://github.com/ProjetPP/PPP-QuestionParsing-Grammatical/commit/f7138b856b1273455b9dd2d40da5d09f6dda37fe
Now it takes 10s to 20s to handle a single word. Without surprise, all the time is spent in the elimination of the candidates, and in particular the queries to the Stanford library.
i increased the limit to 350 candidates so it's normal (but it's not definitive, it was just to make some tests).
What are the longest operations: 1 - querying conceptnet 2 - perform elimination (excepting with the stanford parser) 3 - perform pos tag with the stanford parser
For the first 1, we have to use a server that runs locally before saying anything. It should be normal that querying conceptnet as we do now is longer than using a server.
For steps 3, it's temporary. If it's efficient, we can re-implement the part of the stanford parser that perform pos tag.
It would be strange that step 2 takes so long time.
Maybe an interesting link about POS tagging (did not read it entirely) : https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/
FYI, I tried speeding it up by running requests in parallel:
@@ -132,18 +133,27 @@ def buildCandidate(pattern,edge):
else:
return None
+import multiprocessing
+import uuid
+g_pattern = {}
+def f(x):
+ (foo, e) = x
+ pattern = g_pattern[foo]
+ cand = buildCandidate(pattern,e)
+ if cand != None and cand.tag != -1:
+ return cand
def associatedWords(pattern,relations):
uri = "/c/{0}/{1}".format(default_language,pattern)
r = list(lookup(uri,limit=350))
CLOCK.time_step("lookup")
#for e in r:
# print(e['start'] + ' ' + e['rel'] + ' ' + e['end'])
- res = []
- for e in r:
- if e['rel'] in relations:
- cand = buildCandidate(pattern,e)
- if cand != None and cand.tag != -1:
- res.append(cand)
+ foo = uuid.uuid4()
+ g_pattern[foo] = pattern
+ l = [(foo, e) for e in r if e['rel'] in relations]
+ with multiprocessing.Pool(2) as p:
+ res = p.map(f, l)
+ del g_pattern[foo]
+ res = list(filter(bool, res))
#for cand in res:
# print(cand.word + ' ' + str(cand.weight))
CLOCK.time_step("buildCandidate")
But the execution time is exactly the same
I've just added a new demo file conceptnet_server.py
that makes queries to conceptnet using a server.
How to use it:
python3 -m conceptnet5.api
./conceptnet_server.py banana
I think it's quicker now (and still a lot of improvements are possible).
@ProgVal when they say to set up a WSGI server
here: https://github.com/commonsense/conceptnet5/wiki/Running-your-own-copy what does it means to be "more robust"?
It means that HTTP servers written in a naive way are very unefficient and can be easily DoSed. Implementing a WSGI interface allows you to provide a service through a well-written HTTP server (nginx, lighttpd, Apache (more arguably), …)
I think a POS tagger is not the best for what we want, since it is used for context-dependant tagging.
A simple dictionnary which would map each word to a set of possible parts of speach would be quicker. Then, we would keep the word only if this is possibly a noun.
Unfortunately, I do not find a good multilingual dictionnary providing the part of speech (it seems Aspell does not do that).
i've just throw a "hook" on stackoverflow about this question. Let's other people think about it...
Can you give the link?
good idea, i post questions on stackoverflow and you top them :)
Phase1: all PPP members must have a high reputation on Stackoverflow. Phase2: platypus proselytism can begin.
I DO NOT WANT NLTK http://stackoverflow.com/a/28034218/3476917
how many nouns can we expect in english? (and how fast it is to perform search into the set of all the nouns?)
This proposition uses NLTK only for precomputing purpose. I find it quite good.
how many nouns can we expect in english? (and how fast it is to perform search into the set of all the nouns?)
Not only in english. We would need such a set for each supported language.
yes, it's not really about NLTK. But is it realistic to perform search in a so big set?
do you success to run its algorithm? i obtain AttributeError: 'function' object has no attribute 'split'
after running the second line
He/she forgot parenthesis after name:
nouns = {x.name().split('.', 1)[0] for x in wn.all_synsets('n')}
67176 elements
it's fast but don't we lose a lot of nouns? (only 67176 nouns in english?)
We could do the opposite thing:
That's small, log(67k) < 17. And a lot of words seems to be in this list: in the list of 20 verbs I gave, only 4 of them are not in the list. It seems ok to me.
what are the words that don't occur in the set?
['wrote', 'directed', 'ran', 'build']
but it's not nouns?
I don't think so. Check in a dictionnary to be sure.
Tested on 207 words. Null time to handle all these words. 91 of them were said to be nouns.
How to perform nounification using ConceptNet5