Open ChrisJBlake opened 3 years ago
Hi @ChrisJBlake , I think I understand the use case, but it breaks down for a very common situation: when a keyterm appears more than once in a single document. Since frequency of occurrence is a good indicator of importance, higher-ranking terms are more likely to appear multiple times! If the positions were returned for such a keyterm, which appearance should they represent? And just for clarity: What are you doing with the returned positions?
Hi @bdewilde, that's a very good point about frequent terms I hadn't originally considered! Would it be reasonable to return a list of all positions for the given key phrase instead, to capture each of the original appearances? Though this may also bring up an issue after normalization, where terms with different contextual meanings may map to the same normalized form, ex. "The hotel was booked and it was full of books". A little contrived, but an extracted phrase after lemmatization "book" would map to "booked" and "books", despite their different meanings.
I'm using the returned positions to highlight the original positions of each key phrase in the text, which can't easily be mapped back to if the key phrases are normalized in a non-trivial way, such as lemmatization.
context
I'm looking to get the original token positions of keyterms when performing keyterm extraction with e.g. TextRank, but this can apply to the other extractors. Example:
This would provide a mapping back to the original spaCy
doc
that can be used to find the keyterm regardless of how it was normalized during keyterm extraction.proposed solution
The main solution I envision would be to add a new keyword argument to the extractors such as
return_positions
(defaulting toFalse
to not break existing workflows) that would add the original token indices in addition to the term as a string and its score. Since textaCy can access theToken
positions while considering candidate terms (before they're normalized to strings), it would be a matter of passing these indices along when returning the set of candidate tuples. The main issues with this approach would be that it would clutter the existing implementation with effectively two return types: the currentList[Tuple[str, float]]
ifreturn_positions
isFalse
, orList[Tuple[str, float, int, int]]
ifTrue
. Any preprocessing functions (such as_get_candidates
for TextRank) would have to be modified to pass these indices along, and any code calling these functions now has to handle different return types.alternative solutions?
One solution I considered (outside of the keyterm extraction functions) was to rescan the document for each keyterm, but that ultimately requires scanning through the
Doc
an additional time, vs. textaCy already has the original token positions, allowing a client to immediately re-index into theDoc
.I have some proof-of-concept code adding this feature for TextRank in a fork, and would be willing to extend this feature to the other extractors if this sounds like a useful idea!