chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

Fix normalization for the keyword extractor YAKE #332

Closed mirkolenz closed 3 years ago

mirkolenz commented 3 years ago

Description

Add the normalization option 'normalize="norm"' for YAKE and change the behavior of the option 'normlize=None' to return the attribute 'orth' of the token.

Motivation and Context

The documentation says that setting 'normalize=None' for YAKE returns the terms as they appear in the original document. Currently however, the attribute 'norm' of the token is returned, which can be different from the original representation (e.g., the token 'centres' would be extracted as 'centers'). Thus, I make use of the attribute 'orth' when setting 'normalize=None'. The same attribute is also used in the TextRank algorithm. Additionally, I added the option 'normalize="norm"' s.t. the current behavior can still be used.

How Has This Been Tested?

I added the corresponding tests in tests/extract/keyterms/test_yake.py.

Screenshots (if appropriate):

Types of changes

Checklist:

mirkolenz commented 3 years ago

I just updated this PR to remove two assertions that would not universally hold (thus causing some tests to fail).

bdewilde commented 3 years ago

Hi @MirkoLenz , thanks very much for catching and fixing this! Everything looks good. Since technically it's changing functionality, I'm going to point it at the develop branch (instead of master) and then merge it in. Thanks again.

Scratch that, GitHub gave me a strange warning about "losing commits" if I switch the base branch to develop, so we're just going to roll right into master. It'll probably be fine... 😅