Fix normalization for the keyword extractor YAKE

mirkolenz commented 3 years ago

Description

Add the normalization option 'normalize="norm"' for YAKE and change the behavior of the option 'normlize=None' to return the attribute 'orth' of the token.

Motivation and Context

The documentation says that setting 'normalize=None' for YAKE returns the terms as they appear in the original document. Currently however, the attribute 'norm' of the token is returned, which can be different from the original representation (e.g., the token 'centres' would be extracted as 'centers'). Thus, I make use of the attribute 'orth' when setting 'normalize=None'. The same attribute is also used in the TextRank algorithm. Additionally, I added the option 'normalize="norm"' s.t. the current behavior can still be used.

How Has This Been Tested?

I added the corresponding tests in tests/extract/keyterms/test_yake.py.

Screenshots (if appropriate):

Types of changes

[x] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

[x] My code follows the code style of this project.
[x] My change requires a change to the documentation, and I have updated it accordingly.

mirkolenz commented 3 years ago

I just updated this PR to remove two assertions that would not universally hold (thus causing some tests to fail).

bdewilde commented 3 years ago

Hi @MirkoLenz , thanks very much for catching and fixing this! Everything looks good. Since technically it's changing functionality, I'm going to point it at the develop branch (instead of master) and then merge it in. Thanks again.

Scratch that, GitHub gave me a strange warning about "losing commits" if I switch the base branch to develop, so we're just going to roll right into master. It'll probably be fine... 😅

chartbeat-labs / textacy