Return keyterm positions in original document when performing keyterm extraction

ChrisJBlake commented 3 years ago

context

I'm looking to get the original token positions of keyterms when performing keyterm extraction with e.g. TextRank, but this can apply to the other extractors. Example:

>>> doc = nlp("I survived because the fire inside me burned brighter than the fire around me.")
>>> textrank(doc, return_positions=True)
[("fire", 0.1, 4, 4)]

This would provide a mapping back to the original spaCy doc that can be used to find the keyterm regardless of how it was normalized during keyterm extraction.

proposed solution

The main solution I envision would be to add a new keyword argument to the extractors such as return_positions (defaulting to False to not break existing workflows) that would add the original token indices in addition to the term as a string and its score. Since textaCy can access the Token positions while considering candidate terms (before they're normalized to strings), it would be a matter of passing these indices along when returning the set of candidate tuples. The main issues with this approach would be that it would clutter the existing implementation with effectively two return types: the current List[Tuple[str, float]] if return_positions is False, or List[Tuple[str, float, int, int]] if True. Any preprocessing functions (such as _get_candidates for TextRank) would have to be modified to pass these indices along, and any code calling these functions now has to handle different return types.

alternative solutions?

One solution I considered (outside of the keyterm extraction functions) was to rescan the document for each keyterm, but that ultimately requires scanning through the Doc an additional time, vs. textaCy already has the original token positions, allowing a client to immediately re-index into the Doc.

I have some proof-of-concept code adding this feature for TextRank in a fork, and would be willing to extend this feature to the other extractors if this sounds like a useful idea!

bdewilde commented 3 years ago

Hi @ChrisJBlake , I think I understand the use case, but it breaks down for a very common situation: when a keyterm appears more than once in a single document. Since frequency of occurrence is a good indicator of importance, higher-ranking terms are more likely to appear multiple times! If the positions were returned for such a keyterm, which appearance should they represent? And just for clarity: What are you doing with the returned positions?

ChrisJBlake commented 3 years ago

Hi @bdewilde, that's a very good point about frequent terms I hadn't originally considered! Would it be reasonable to return a list of all positions for the given key phrase instead, to capture each of the original appearances? Though this may also bring up an issue after normalization, where terms with different contextual meanings may map to the same normalized form, ex. "The hotel was booked and it was full of books". A little contrived, but an extracted phrase after lemmatization "book" would map to "booked" and "books", despite their different meanings.

I'm using the returned positions to highlight the original positions of each key phrase in the text, which can't easily be mapped back to if the key phrases are normalized in a non-trivial way, such as lemmatization.

chartbeat-labs / textacy