chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

subject_verb_object_triples works on sample text, but not a more complicated example #317

Closed cj2001 closed 3 years ago

cj2001 commented 3 years ago

I have tried a sample text that I have seen in this repo to extract SVO triples:

text_str = u'Startup companies create jobs and support innovation. Hilary supports entrepreneurship.'
text = nlp(text_str)
text_ext = textacy.extract.subject_verb_object_triples(text)
list(text_ext)

This runs just fine for me and produces the expected result:

[(companies, create, jobs), (Hilary, supports, entrepreneurship)]

However, I then try to do something a bit more complicated:

text_ada = (u'Ada Lovelace was an English mathematician and' 
            ' writer, chiefly known for her work on'
            ' mechanical general-purpose computer, the'
            ' Analytical Engine. She was the first to'
            ' recognise that the machine had applications'
            ' beyond pure calculation, and published the'
            ' first algorithm intended to be carried out' 
            ' by such a machine. As a result, she is'
            ' sometimes regarded as the first to recognise'
            ' the full potential of a computing machine and'
            ' one of the first computer programmers.')

So I run through the same procedure:

ada = nlp(text_ada)
text_ext = textacy.extract.subject_verb_object_triples(ada)
list(text_ext)

but I get an empty list. I have also tried this for the first sentence of this longer string and gotten the same result.

Any thoughts?

bdewilde commented 3 years ago

Hi @cj2001 , the extraction of SVO triples uses a somewhat simple, rule-based approach, and further depends on the dependency annotations provided by spaCy's language models, so accuracy on individual examples or cases is impossible to guarantee. That said, I recently re-implemented this in a way that catches more cases and, subjectively, does a better job of extracting identifiable triples. Here's the new function's test suite showing the various cases it handles: https://github.com/chartbeat-labs/textacy/blob/develop/tests/extract/test_triples.py#L30-L118

I'll be releasing a new version of textacy soon-ish, and the updated SVO triple functionality will be included in that. In the meantime, please feel free to check out the improved function from the develop branch!

bdewilde commented 3 years ago

Hi again! As mentioned, SVO triple extraction has been totally reworked, and is included in today's v0.11 release: https://github.com/chartbeat-labs/textacy/releases/tag/0.11.0

cj2001 commented 3 years ago

Thank you, @bdewilde!