Closed mandanafasounaki closed 1 year ago
Hey @zhenya-pl would you be able to help with this issue?
Most likely it's a problem with part-of-speech tagging, where in this particular instance the word "have" is not labeled as a verb. If you give the actual example, I can take a look in detail.
Hey @mandanafasounaki, please share the example details here and tag zhenya-pl, so that Zhenya can help out with this issue.
I am having problems with this example also. I also had to change the line textacy.extract.matches to textacy.extract.matches.token_matches But when I run the code I get an error:
File "/home/xxx/PythonForNLP/Python-Natural-Language-Processing-Cookbook/Chapter02/entities_and_relations.py", line 77, in
verb_phrases = textacy.extract.matches.token_matches(doc, verb_patterns) returns an empty list for the sentence "Cells have organelles."
The patterns matched by the code are:
verb_patterns = [[{"POS":"AUX"}, {"POS":"VERB"}, {"POS":"ADP"}], [{"POS":"AUX"}]]
Make sure "have" fits one of these patterns (print out which part of speech is assigned to the word). If it doesn't, the code won't find any verb phrases.
If it doesn't find any verb phrases, the list is empty, and so the index 0 is out of range.
The code was written 2 years ago; the models and the POS tagging have changed, hence the difference in the way the sentence is processed.
Cells have organelles.
Text | Index | POS | Dep | Dep Detail | Ancestors | Children
----------------------------------------------------------------------------------------------------------------------
Cells | 0 | NOUN | nsubj | nominal subject | have |
----------------------------------------------------------------------------------------------------------------------
have | 1 | VERB | ROOT | root | | Cells organelles .
----------------------------------------------------------------------------------------------------------------------
organelles | 2 | NOUN | dobj | direct object | have |
----------------------------------------------------------------------------------------------------------------------
. | 3 | PUNCT | punct | punctuation | have |
----------------------------------------------------------------------------------------------------------------------
So you need to add the pattern [{"POS":"VERB"}] to verb_patterns.
It is already there? verb_patterns = [[{"POS": "AUX"}, {"POS": "VERB"}, {"POS": "ADP"}], [{"POS": "AUX"}]]
No, the patterns are
[ [{"POS":"AUX"}, {"POS":"VERB"}, {"POS":"ADP"}], [{"POS":"AUX"}] ]
Is that not the same as what I posted?
This is the complete code:
import spacy
import textacy as tx
# from Chapter02.split_into_clauses import find_root_of_sentence
nlp = spacy.load('en_core_web_sm')
# sentences = ["All living things are made of cells.", "Cells have organelles."]
# sentences = ["All living things are made of cells."]
sentences = ["Cells have organelles."]
# verb_patterns = [[{"POS": "AUX"}, {"POS": "VERB"},
# {"POS": "ADP"}], [{"POS": "AUX"}]]
verb_patterns = [
[{"POS": "AUX"}, {"POS": "VERB"}, {"POS": "ADP"}],
[{"POS": "AUX"}]
]
def contains_root(verb_phrase, root):
vp_start = verb_phrase.start
vp_end = verb_phrase.end
if (root.i >= vp_start and root.i <= vp_end):
return True
else:
return False
def find_root_of_sentence(doc):
root_token = None
for token in doc:
if (token.dep_ == "ROOT"):
root_token = token
return root_token
def get_verb_phrases(doc):
root = find_root_of_sentence(doc)
verb_phrases = tx.extract.matches.token_matches(doc, verb_patterns)
new_vps = []
for verb_phrase in verb_phrases:
if (contains_root(verb_phrase, root)):
new_vps.append(verb_phrase)
return new_vps
def longer_verb_phrase(verb_phrases):
longest_length = 0
longest_verb_phrase = None
for verb_phrase in verb_phrases:
if len(verb_phrase) > longest_length:
longest_verb_phrase = verb_phrase
return longest_verb_phrase
def find_noun_phrase(verb_phrase, noun_phrases, side):
for noun_phrase in noun_phrases:
if (side == "left" and noun_phrase.start < verb_phrase.start):
return noun_phrase
elif (side == "right" and noun_phrase.start > verb_phrase.start):
return noun_phrase
def find_triplet(sentence):
doc = nlp(sentence)
verb_phrases = list(get_verb_phrases(doc))
noun_phrases = doc.noun_chunks
verb_phrase = None
if (len(verb_phrases) > 1):
verb_phrase = longer_verb_phrase(list(verb_phrases))
else:
verb_phrase = verb_phrases[0]
left_noun_phrase = find_noun_phrase(verb_phrase, noun_phrases, "left")
right_noun_phrase = find_noun_phrase(verb_phrase, noun_phrases, "right")
return (left_noun_phrase, verb_phrase, right_noun_phrase)
def main():
for sentence in sentences:
(left_np, vp, right_np) = find_triplet(sentence)
print(left_np, "\t", vp, "\t", right_np)
if (__name__ == "__main__"):
main()
verb_phrase = verb_phrases[0]
IndexError: list index out of range
The pattern of just VERB is not in the list verb_patterns.
verb_patterns is a list of lists. There are two lists that will match verb phrases:
[{"POS":"AUX"}, {"POS":"VERB"}, {"POS":"ADP"}]
and
[{"POS":"AUX"}]
The first pattern will match three words in a row, AUX followed by VERB by ADP (preposition).
The second pattern will match one single AUX.
You need to add the pattern [{"POS":"VERB"}] to the list.
So:
verb_patterns = [[{"POS":"AUX"}, {"POS":"VERB"}, {"POS":"ADP"}], [{"POS":"AUX"}], [{"POS":"VERB"}]]
verb_patterns = [
[{"POS": "AUX"}, {"POS": "VERB"}, {"POS": "ADP"}],
[{"POS": "VERB"}]
]
This works
Thank you for your help. I am enjoying your book!
No problem, glad you're enjoying it!
Thank you for your help.
get_verb_phrases doesn't extract "have" as the verb. I'm using the same code. PS: I had to change textacy.extract.matches to textacy.extract.matches.token_matches