chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

subject_verb_object_triples Enhancement #198

Closed The-Gupta closed 3 years ago

The-Gupta commented 6 years ago

Could Textacy work on references(maybe coreferences also) and adjectives of subjects and objects, and adverb of verbs?

from __future__ import absolute_import, unicode_literals
from textacy import cache, extract

spacy_lang = cache.load_spacy('en')

text = """
"Sam and his fat friend are nice men that didn't hurt my child and sister"
"""

spacy_doc = spacy_lang(text.strip())

[', '.join(item.text for item in triple) for triple in \
 extract.subject_verb_object_triples(spacy_doc)]
Out[33]: 
['Sam, are, men',
 'friend, are, men',
 "that, didn't hurt, child",
 "that, didn't hurt, sister"]

EXPECTED OUTPUT: ['Sam, are, men', 'Sam's fat friend, are, men', "Sam and his fat friend, didn't hurt, my child", "Sam and his fat friend, didn't hurt, my sister"]

BETTER THIS, I DON'T KNOW IF IT'S EASILY POSSIBLE ['Sam, is, man', 'Sam's fat friend, is, man', "Sam and his fat friend, didn't hurt, my child", "Sam and his fat friend, didn't hurt, my sister"]

bdewilde commented 3 years ago

Hi @The-Gupta , shame on me for never responding to this... 🤦‍♂️ You're probably long past this issue, but for what it's worth, SVO triple extraction has been reworked and improved in the latest v0.11 release: https://github.com/chartbeat-labs/textacy/releases/tag/0.11.0

Here are the cases (as tests) that the new method is expected to work for: https://github.com/chartbeat-labs/textacy/blob/master/tests/extract/test_triples.py#L30-L118 . It doesn't include adjectives/adverbs, but does more consistently include companion tokens such as verb negations and auxiliaries plus noun compounds and conjuncts.