Problem ranking text containing abbreviation, such as U.S.A

DavidBelicza / TextRank

:wink: :cyclone: :strawberry: TextRank implementation in Golang with extendable features (summarization, phrase extraction) and multithreading (goroutine).

MIT License

204 stars 22 forks source link

Hi, first of all thanks for this library, you are awesome 🚀

I'm having an issue ranking text that contains abbreviation such as U.S.A (short for United States of America) or No. 7 (short for Number 7) as the . is currently used here https://github.com/DavidBelicza/TextRank/blob/master/parse/rule.go#L21 to set the bounds of words.

Do you currently have a way to get around this problem? Or should I simply create a new rule implementing the Rule interface that checks for known abbreviations?

Hi @xD0135, I faced this issue too a while ago. The reason why I left this as it is because the solution would be domain-specific. As you mentioned implementing the Rule interface can be the solution.

If you create a whitelist of tokens for skipping the checking of these words and keep them as tokens that could work. However, I think this would be too domain-specific for this repo.

Or the sentence separator list in the Rule could have ". " or ".\n" instead of ".". But in this case, not all texts could be parsed well. I should know the general usage of this package. If usually, the text originates from emails, forums, chats then changing the sentence separator could work. But if the text is from parsed books then it could break the tokenization.

DavidBelicza / TextRank

Problem ranking text containing abbreviation, such as U.S.A #14