PacktPublishing / Python-Natural-Language-Processing-Cookbook

Python Natural Language Processing Cookbook, published by Packt
MIT License
167 stars 98 forks source link

Chapter02 Triplet Extraction Problem #4

Closed mandanafasounaki closed 1 year ago

mandanafasounaki commented 1 year ago

get_verb_phrases doesn't extract "have" as the verb. I'm using the same code. PS: I had to change textacy.extract.matches to textacy.extract.matches.token_matches

rajat-packt commented 1 year ago

Hey @zhenya-pl would you be able to help with this issue?

zhenya-pl commented 1 year ago

Most likely it's a problem with part-of-speech tagging, where in this particular instance the word "have" is not labeled as a verb. If you give the actual example, I can take a look in detail.

rajat-packt commented 1 year ago

Hey @mandanafasounaki, please share the example details here and tag zhenya-pl, so that Zhenya can help out with this issue.

johnosbb commented 1 year ago

I am having problems with this example also. I also had to change the line textacy.extract.matches to textacy.extract.matches.token_matches But when I run the code I get an error:

File "/home/xxx/PythonForNLP/Python-Natural-Language-Processing-Cookbook/Chapter02/entities_and_relations.py", line 77, in main() File "/home/xxx/PythonForNLP/Python-Natural-Language-Processing-Cookbook/Chapter02/entities_and_relations.py", line 72, in main (left_np, vp, right_np) = find_triplet(sentence) File "/home/xxx/PythonForNLP/Python-Natural-Language-Processing-Cookbook/Chapter02/entities_and_relations.py", line 64, in find_triplet verb_phrase = verb_phrases[0] IndexError: list index out of range

verb_phrases = textacy.extract.matches.token_matches(doc, verb_patterns) returns an empty list for the sentence "Cells have organelles."

zhenya-pl commented 1 year ago

The patterns matched by the code are:

verb_patterns = [[{"POS":"AUX"}, {"POS":"VERB"}, {"POS":"ADP"}], [{"POS":"AUX"}]]

Make sure "have" fits one of these patterns (print out which part of speech is assigned to the word). If it doesn't, the code won't find any verb phrases.

If it doesn't find any verb phrases, the list is empty, and so the index 0 is out of range.

zhenya-pl commented 1 year ago

The code was written 2 years ago; the models and the POS tagging have changed, hence the difference in the way the sentence is processed.

johnosbb commented 1 year ago
Cells have organelles.
Text         | Index  | POS      | Dep      | Dep Detail               | Ancestors            | Children   
----------------------------------------------------------------------------------------------------------------------
Cells        | 0      | NOUN     | nsubj    | nominal subject          | have                 |            
----------------------------------------------------------------------------------------------------------------------
have         | 1      | VERB     | ROOT     | root                     |                      | Cells organelles . 
----------------------------------------------------------------------------------------------------------------------
organelles   | 2      | NOUN     | dobj     | direct object            | have                 |            
----------------------------------------------------------------------------------------------------------------------
.            | 3      | PUNCT    | punct    | punctuation              | have                 |            
----------------------------------------------------------------------------------------------------------------------
zhenya-pl commented 1 year ago

So you need to add the pattern [{"POS":"VERB"}] to verb_patterns.

johnosbb commented 1 year ago

It is already there? verb_patterns = [[{"POS": "AUX"}, {"POS": "VERB"}, {"POS": "ADP"}], [{"POS": "AUX"}]]

zhenya-pl commented 1 year ago

No, the patterns are

[ [{"POS":"AUX"}, {"POS":"VERB"}, {"POS":"ADP"}], [{"POS":"AUX"}] ]

johnosbb commented 1 year ago

Is that not the same as what I posted?

johnosbb commented 1 year ago

This is the complete code:

import spacy
import textacy as tx
# from Chapter02.split_into_clauses import find_root_of_sentence

nlp = spacy.load('en_core_web_sm')
# sentences = ["All living things are made of cells.", "Cells have organelles."]
# sentences = ["All living things are made of cells."]
sentences = ["Cells have organelles."]

# verb_patterns = [[{"POS": "AUX"}, {"POS": "VERB"},
#                   {"POS": "ADP"}], [{"POS": "AUX"}]]

verb_patterns = [
    [{"POS": "AUX"}, {"POS": "VERB"}, {"POS": "ADP"}],
    [{"POS": "AUX"}]
]

def contains_root(verb_phrase, root):
    vp_start = verb_phrase.start
    vp_end = verb_phrase.end
    if (root.i >= vp_start and root.i <= vp_end):
        return True
    else:
        return False

def find_root_of_sentence(doc):
    root_token = None
    for token in doc:
        if (token.dep_ == "ROOT"):
            root_token = token
    return root_token

def get_verb_phrases(doc):
    root = find_root_of_sentence(doc)
    verb_phrases = tx.extract.matches.token_matches(doc, verb_patterns)
    new_vps = []
    for verb_phrase in verb_phrases:
        if (contains_root(verb_phrase, root)):
            new_vps.append(verb_phrase)
    return new_vps

def longer_verb_phrase(verb_phrases):
    longest_length = 0
    longest_verb_phrase = None
    for verb_phrase in verb_phrases:
        if len(verb_phrase) > longest_length:
            longest_verb_phrase = verb_phrase
    return longest_verb_phrase

def find_noun_phrase(verb_phrase, noun_phrases, side):
    for noun_phrase in noun_phrases:
        if (side == "left" and noun_phrase.start < verb_phrase.start):
            return noun_phrase
        elif (side == "right" and noun_phrase.start > verb_phrase.start):
            return noun_phrase

def find_triplet(sentence):
    doc = nlp(sentence)
    verb_phrases = list(get_verb_phrases(doc))
    noun_phrases = doc.noun_chunks
    verb_phrase = None
    if (len(verb_phrases) > 1):
        verb_phrase = longer_verb_phrase(list(verb_phrases))
    else:
        verb_phrase = verb_phrases[0]
    left_noun_phrase = find_noun_phrase(verb_phrase, noun_phrases, "left")
    right_noun_phrase = find_noun_phrase(verb_phrase, noun_phrases, "right")
    return (left_noun_phrase, verb_phrase, right_noun_phrase)

def main():
    for sentence in sentences:
        (left_np, vp, right_np) = find_triplet(sentence)
        print(left_np, "\t", vp, "\t", right_np)

if (__name__ == "__main__"):
    main()
    verb_phrase = verb_phrases[0]
IndexError: list index out of range
zhenya-pl commented 1 year ago

The pattern of just VERB is not in the list verb_patterns.

verb_patterns is a list of lists. There are two lists that will match verb phrases:

[{"POS":"AUX"}, {"POS":"VERB"}, {"POS":"ADP"}]

and

[{"POS":"AUX"}]

The first pattern will match three words in a row, AUX followed by VERB by ADP (preposition).

The second pattern will match one single AUX.

You need to add the pattern [{"POS":"VERB"}] to the list.

So:

verb_patterns = [[{"POS":"AUX"}, {"POS":"VERB"}, {"POS":"ADP"}], [{"POS":"AUX"}], [{"POS":"VERB"}]]

johnosbb commented 1 year ago
verb_patterns = [
    [{"POS": "AUX"}, {"POS": "VERB"}, {"POS": "ADP"}],
    [{"POS": "VERB"}]
]

This works

johnosbb commented 1 year ago

Thank you for your help. I am enjoying your book!

zhenya-pl commented 1 year ago

No problem, glad you're enjoying it!

mandanafasounaki commented 1 year ago

Thank you for your help.