chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

Support prepositional objects in SVO IE #312

Closed 8W9aG closed 3 years ago

8W9aG commented 3 years ago

Support the handling of prepositions when performing Subject-Verb-Object Information Extraction.

Description

When running Subject-Verb-Object analysis the current program cannot handle looking past a preposition before a noun, and therefore cannot extract the Object correctly when there is a preposition in front of it.

Motivation and Context

I was running this on a test string "Barack Obama was born in Hawaii." and failed to get a Subject-Verb-Object triple.

How Has This Been Tested?

I run a modified version of textacy 0.10.1 that isolates the "subject_verb_object" function and has support for running on spacy 2.1.0. I tested this in this modified version and the input "Barack Obama was born in Hawaii." produces the triple ("Barack Obama", "born in", "Hawaii").

Screenshots (if appropriate):

N/A

Types of changes

Checklist:

bdewilde commented 3 years ago

Hi @8W9aG, thanks for the PR and your patience. What you're doing in the code makes sense, but I think it would be helpful to see tangible, expected outputs by way of a couple unit tests. Could you add a few representative examples to a test, for reference?

bdewilde commented 3 years ago

Hi @8W9aG , I just merged a fairly significant update to the subject_verb_object_triples() function into the develop branch (see PR #325 ), which improves the quality and generality of extracted SVO triples. It doesn't allow for adverbial clauses to count as objects as you have here, but it does handle the case where we have a passive verb and an agent object marked by a preposition: for example, "Code was written by Burton. => (["Code"], ["was", "written"], ["Burton"]).

After doing a fairly deep dive into grammar and dependency parsing for that PR, I think your case isn't technically an SVO and thus shouldn't be supported directly. That said, if you take a look under the function's hood, it should be relatively straightforward to adapt the code that extracts S+V then replace the code for O with something more inclusive. Does that seem reasonable?

bdewilde commented 3 years ago

Btw, you can see the broad range of cases handled by the new SVO function in its tests: https://github.com/chartbeat-labs/textacy/blob/develop/tests/extract/test_triples.py#L30-L118

8W9aG commented 3 years ago

👍 Seems reasonable thanks @bdewilde . I will close this PR.