This PR can be summarized in the following changelog entry:
Improves internal linking suggestions for English through normalizing relevant words with a stemming mechanism
Removes 2,3,4,5-word prominent word combinations
Relevant technical choices:
It was decided to remove multiple-word word combinations from relevant words analysis because those were not efficient for internal linking suggestions while they take up a lot of computational resources.
Test instructions
This PR can be tested by following these steps:
Use the example
Make sure specify en_EN locale
Check that no 2,3,4,5-word word combinations are displayed among relevant words
Check that words used in paper attributes (title, keyphrase, synonyms, metadescription, subheadings) as well as in the main text of the copy get collapsed correctly. I.e., there should not be any duplicate words in the output table and the number of occurrences for every word should be occurrences_in_text + 3*occurrences_in_attributes
Check if a stem is displayed next to every relevant word
Check that there are no two entries with the same stem in the table, i.e., all same-stem words were correctly collapsed. In order to test it, you can add different forms of the word in the text and in the paper attributes
Check that words with stems that were only used in the text once do not end up in the list
Check that stems that were used in the text in different forms, while at least some of the forms were only used once, do end up in the list and are counted correctly
Check that function words do not end up in the list
Important!
This PR changes processing of relevant words, a concept that is used in two different features in the plugin: internal linking suggestions and insights. It was agreed that the changes implemented here affect the internal linking suggestions feature, under a presumption that the insights feature gets totally stripped from the plugin. According to recent research, however, our users do make use of the insights feature. We need to decide if we still want to strip the insights functionality or we want to decouple insights from internal linking suggestions completely and use the old mechanism of relevant words computation for the insights and the new one (from this PR) for the internal linking suggestions.
Please do not merge this PR before it is decided on ☝️
Summary
This PR can be summarized in the following changelog entry:
Relevant technical choices:
Test instructions
This PR can be tested by following these steps:
example
en_EN
localeImportant!
This PR changes processing of relevant words, a concept that is used in two different features in the plugin:
internal linking suggestions
andinsights
. It was agreed that the changes implemented here affect theinternal linking suggestions
feature, under a presumption that theinsights
feature gets totally stripped from the plugin. According to recent research, however, our users do make use of theinsights
feature. We need to decide if we still want to strip theinsights
functionality or we want to decoupleinsights
frominternal linking suggestions
completely and use the old mechanism of relevant words computation for theinsights
and the new one (from this PR) for theinternal linking suggestions
.Please do not merge this PR before it is decided on ☝️
Fixes #2139