chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

Extracting SVO triplets -> get token indexes? #299

Closed sunosnoc closed 4 years ago

sunosnoc commented 4 years ago

Dear all, Thanks for the wonderful textacy library! The SpaCy universe is the reason why I switched from R to Python for my current project. One of the processing steps in my project is the extraction of SVO triplets for a large text corpus. I have been using the textacy.extract.subject_verb_object_triples() function for this. I am still relatively new to Python and maybe I just haven't figured it out yet, but does the function only return the triplets as tuples? Or is there also a reference to the original indexes of the tokens in the parsed doc? In my case this would be very useful, because I want to have the lemmatized version of the SVO triplets, so with the indexes I could just extract them from the parsed doc.

Thanks! Finn

bdewilde commented 4 years ago

Hi @sunosnoc , the subject_verb_object_triples() function returns a triple of spaCy Span objects, which do keep track of their original indexes. You'll probably want the .start and .end attributes; see here for details: https://spacy.io/api/span#attributes

bdewilde commented 4 years ago

I should also note that if you just want the lemmatized versions of the spans, they also have a .lemma_ attribute.

sunosnoc commented 4 years ago

Thanks so much! I was not aware that the span objects maintain indexes!