Add paper attributes to internal linking research

nataliashitova commented 5 years ago

Summary

This PR can be summarized in the following changelog entry:

Uses words from keyphrase, metadescription, title, and subheadings for internal linking suggestions

Relevant technical choices:

The current mechanism might generate duplicates of terms used for internal linking if words or phrases used in paper attributes (keyphrase, metadescription, title, and subheadings) also pass the density requirements in the text of the paper itself. This problem will be solved by the same functionality that will take care of different morphological forms (see #2139)

Test instructions

This PR can be tested by following these steps:

Use the example to check if words from paper attributes are added to the list of relevant words.
Try adding more tests to check if words from paper attributes end up in the relevant words combinations.

Fixes #2114

igorschoester commented 5 years ago

Acceptance 🚧

I have some trouble knowing what specifically needs to get tested here.

When I have a text of 20 lines with: hello world, I get 3 relevant word entries: hello, world and hello world. Just like before. However, when I set the keyphrase (this is the same with any other paper attributes -- except title, see below) to hello world I would've expected to count towards all of them. But it adds two more entries for only hello and world. You told me this will be added in the next PR. Still, I would expect it to count towards the whole keyphrase hello world too. Is this on purpose?
A mismatch between changelog and testing: The SEO title is not used to determine relevant words.
Adding <h2>hello world</h2> to the content adds a count to all the non-special entries. And includes the 2 special entries for 5 occurences. Therefor, it is counted twice. Should this be?
With the above situation the density seems influenced by the <h2></h2> part, since it is lower than just adding hello world as keyphrase for example. 0.119 instead of the expected 0.125. Thinking of this, the actual expected is 0.5 for the keyphrase and others, right?
Adding hello world to all the attributes does not alter the amount of occurrences. Is this on purpose?

nataliashitova commented 5 years ago

@igorschoester Thanks for your great questions!

When I have a text of 20 lines with: hello world, I get 3 relevant word entries: hello, world and hello world. Just like before. However, when I set the keyphrase (this is the same with any other paper attributes -- except title, see below) to hello world I would've expected to count towards all of them. But it adds two more entries for only hello and world. You told me this will be added in the next PR. Still, I would expect it to count towards the whole keyphrase hello world too. Is this on purpose?

Yes. It did not add any more-than-one-word-long word combinations from the paper attributes, because they will soon be deprecated anyway.

A mismatch between changelog and testing: The SEO title is not used to determine relevant words.

My bad 🤦‍♀️

Adding <h2>hello world</h2> to the content adds a count to all the non-special entries. And includes the 2 special entries for 5 occurences. Therefor, it is counted twice. Should this be?

Yes, for now. I will take care of this inconsistency in one of the later PRs that will merge same-word occurrences in the list of relevant words.

With the above situation the density seems influenced by the <h2></h2> part, since it is lower than just adding hello world as keyphrase for example. 0.119 instead of the expected 0.125. Thinking of this, the actual expected is 0.5 for the keyphrase and others, right?

Just as with the one above, I will solve this in the next PR. In general, density is of no importance per se, but it serves a proxy of importance. Therefore, as long as we make sure the right words are being saved in the database, the theoretical impeccability of our ways of calculating the density of relevant words is not important.

Adding hello world to all the attributes does not alter the amount of occurrences. Is this on purpose?

Right now I am saving all words from attributes with the number of occurrences = 5, which was chosen pretty much at random. The next PR with fine-tun the mechanism counting occurrences in text and attributes, as well as selecting good words to be saved in the database.

nataliashitova commented 5 years ago

@hansjovis I agree that we should not allow duplicates in the list of relevant words. It makes a lot of sense to take care of such potential duplicates together with collapsing relevant words based on the stemming mechanism (which was implemented in #2150). What I want to do is:

Take the list made of relevant words coming from the text as well as from the attributes (i.e., with potential duplicates)
Assign a stem to every word
Collapse the relevant words if they have the same stem while keeping number_of_occurrences (both in text and in attributes) valid.

Obviously, duplicates will have the same stem and would be collapsed by this stemming merging mechanism for free.

To reconcile both parties I created a feature branch for internal linking: This will allow to keep development in small chunks and not decrease the quality of the current suggestions. What do you think?

igorschoester commented 5 years ago

A mismatch between changelog and testing: The SEO title is not used to determine relevant words.

Acceptance on this 👍

Unfortunately, the devtool now compiles with the following warning:

./src/components/RelevantWords.js
  Line 75:  Function 'getRelevantWords' has a complexity of 7  complexity

Search for the keywords to learn more about each warning.
To ignore, add // eslint-disable-next-line to the line before.

After a small discussion, it seems best to disable this rule for the example.

igorschoester commented 5 years ago

Merging on the premise that the above issues will be taken into consideration in the feature branch.

Yoast / YoastSEO.js