Yoast / YoastSEO.js

Analyze content on a page and give SEO feedback as well as render a snippet preview.
GNU General Public License v3.0
403 stars 171 forks source link

Keyword is highlighted more times than stated in analysis #2155

Open dariaknl opened 5 years ago

dariaknl commented 5 years ago

Tested with WP 5.0.3 and 10.1 beta2.

How can we reproduce this behavior?

  1. Enter the following text

Maecenas apple a auctor mi. Etiam sapien nulla, eleifend quis convallis sed, sodales vel felis. Vestibulum ante psum primis in faucibus orci luctus et ultrices posuere cubilia Curae;

apple

Fusce malesuada lectus  bi sit ametapple cursus iaculis. Donec sit amet dignissim lectus. Aenean at ante quis quam eleifend volutpat vitae quis nulla. Vestibulum ultrices lectus nec lacus pulvappleinar lobortis. Pellentesque vehicula nulla eu interdum blandit. Curabitur eleifend nulla leo. Aliquam vehicula lacus id orci euismod, non rhoncus lacus cursus. Fusce vitae arcu quis felis ornare sagittis. Proin ultricies purus et molestie. Maecenas apple a auctor mi. Etiam sapien nulla, eleifend quis convallis sed, sodales vel felis. Vestibulum ante psum primis in faucibus orci luctus et ultrices posuere cubilia Curae;

  1. Fill in keyword "apple".

  2. Check assessment:

    screenshot 2019-02-20 at 13 41 23
  3. Toggle eye marker:

    screenshot 2019-02-20 at 13 42 00

It is expected to see only word "apple" being highlighted 3 times.

nataliashitova commented 5 years ago

Interesting fact: the spurious highlights only occur when there is a heading apple. If I remove the heading or I make it apple bla bla or Apple, only correct instances of the keyphrase are highlighted.

nataliashitova commented 5 years ago

Mystery solved

The current mechanism of highlighting keyphrases is as follows.

  1. Search for all possible ways the words from the keyphrase occur in the text. In this example, this step will result in just apple.
  2. Take sentences one by one and hang highlight-tags on words from the keyphrase in them. In this example, this step will result in
    [
    "Maecenas <yoastmark class='yoast-text-mark'>apple</yoastmark> a auctor mi.", 
    "<yoastmark class='yoast-text-mark'>apple</yoastmark>",
    "Maecenas <yoastmark class='yoast-text-mark'>apple</yoastmark> a auctor mi."
    ]
  3. Mark any instances of these sentences in the text. This is the moment when it goes wrong: The word "ametapple" is considered to include the sentence "<yoastmark class='yoast-text-mark'>apple</yoastmark>".

How bad is that?

The problem occurs if the following two conditions are fulfilled:

  1. There exists a sentence with keyphrase words in the text.
  2. There exists a word in the text that includes this sentence entirely (case-sensitive).

It's hard to believe that this happens with a regular text. However, if the keyphrase includes somewhat shorter (non-function) words and if one / some of these words are used alone in a sentence (for instance, in a heading), there is a risk of a problem demonstrated by the issue.

We at Team Lingo believe it's an edge case, which is unlikely to bother a lot of users and which can wait for the new Tree Parser to solve it. @moorscode do you agree?