Yoast / YoastSEO.js

Analyze content on a page and give SEO feedback as well as render a snippet preview.
GNU General Public License v3.0
403 stars 170 forks source link

Add morphological analysis for YoastSEO.js assessments #1500

Open nataliashitova opened 6 years ago

nataliashitova commented 6 years ago

We could benefit from a new function that generates all possible word forms for a word. Right now word forms are not lemmatized within YoastSEO.js. This means that if the user performs, for instance, keyword research, only exact matches of the keyword (case-insensitive) will be obtained. Same holds for prominent words/internal linking suggestions.

Keyword: "text". Paper: "That is a short text, nothing special, just as many other texts." Keyword density: 0.083

That is wrong and we should be able to recognize word forms in the text. At least just as well as Google does.

Keyword: "text". Paper: "That is a short text, nothing special, just as many other texts." Keyword density: 0.17

We should add morphological analysis for the languages that are morphologically supported by Google (English, German, Russian, Dutch to begin with).

There are few ways to achieve that:

  1. Ship a dictionary with all word forms to the user (based on the language (s)he writes in. This solution requires architectural changes (e.g., ability to ship components on demand, store them locally, but outside the browser). It also appears challenging to obtain open-source morphological dictionaries for all languages we are interested in. Otherwise, this method will probably generate the most accurate result for all languages.

  2. Build all word forms within the plugin for the given keyword only. This solution will not require architectural changes, but some development within the plugin code (e.g., regexes for regular forms, word lists for irregulars). I expect that it will work well for languages with rather regular and poor morphology (English, Dutch), but it might be challenging (though not impossible) to achieve it for languages with somewhat more irregular and richer morphology (e.g., Russian). There are multiple projects on Git which we could use to help development, predominantly MIT licensed (references will be added).

  3. Ask the user to fill in all the possible forms for a given keyword. This solution is probably the easiest to implement, but (1) it is error-prone; (2) it will only work for keywords, but not for internal linking suggestions. The accuracy of the method is difficult to estimate, it would depend on how aware the user is regarding morphology and is able to conjugate/inflect words of his/her native language.

Provisional conclusion: It might be meaningful to try implementing different methods and see which one works best. A realistic prospective is that we will need different methods for different languages (i.e., regexes for English/Dutch; dictionaries for Russian/German).

jonoalderson commented 6 years ago

I'm not averse to #3, as that also allows for close synonyms, too - and we can usually assume Google is consolidating these into single concepts.

nataliashitova commented 6 years ago

@jono-alderson I agree. But with that option we should rely on the user's knowledge about what Google does and what it does not, which also differs massively for different languages. Maybe #3 will be good for synonyms alone, but I would not want to rely on the user's ability to build all possible word forms.

jonoalderson commented 6 years ago

Makes sense. Ok, so #3 is a nice-to-have for additional flavour, but the core solution really needs a silent/invisible process (which means #1 or #2).

hansjovis commented 6 years ago

This is closely related to the subject of stemming in Information Retrieval. This is the process of reducing a word to its 'stem', which is similar to the concept of an infinitive form of a word. These stems are not necessarily words in itself and many stemming algorithms tend to be pretty blunt in this sense.

Examples:

Word Stem
conspiring conspir
conspicuous conspicu

Search engines do this on documents on index time as well as on queries during search. This reduces search time and memory usage, since the search engine can focus only on the stem. (stemming can reduce index file sizes up to 50%)

I think we can definitely find some inspiration in these approaches. It could have these advantages:

  1. It increases the efficiency of the code, since it would only have to focus on the stem, instead of all the different inflections of a word.
  2. The amount of exceptions would be reduced, since many of the exceptions seem to have the same stem (see the morphology data).
  3. It would be more similar to how search engines tackle it, although this should not be a goal in itself, I think.

This does significantly increase the risk of false positives.

All in all there is a lot of literature on the subject, some dating back to the 60's, and a variety of different ways to tackle it, as @nataliashitova rightfully mentioned above.