Open nataliashitova opened 6 years ago
I'm not averse to #3, as that also allows for close synonyms, too - and we can usually assume Google is consolidating these into single concepts.
@jono-alderson I agree. But with that option we should rely on the user's knowledge about what Google does and what it does not, which also differs massively for different languages. Maybe #3 will be good for synonyms alone, but I would not want to rely on the user's ability to build all possible word forms.
Makes sense. Ok, so #3 is a nice-to-have for additional flavour, but the core solution really needs a silent/invisible process (which means #1 or #2).
This is closely related to the subject of stemming in Information Retrieval. This is the process of reducing a word to its 'stem', which is similar to the concept of an infinitive form of a word. These stems are not necessarily words in itself and many stemming algorithms tend to be pretty blunt in this sense.
Examples:
Word | Stem |
---|---|
conspiring | conspir |
conspicuous | conspicu |
Search engines do this on documents on index time as well as on queries during search. This reduces search time and memory usage, since the search engine can focus only on the stem. (stemming can reduce index file sizes up to 50%)
I think we can definitely find some inspiration in these approaches. It could have these advantages:
This does significantly increase the risk of false positives.
All in all there is a lot of literature on the subject, some dating back to the 60's, and a variety of different ways to tackle it, as @nataliashitova rightfully mentioned above.
We could benefit from a new function that generates all possible word forms for a word. Right now word forms are not lemmatized within YoastSEO.js. This means that if the user performs, for instance, keyword research, only exact matches of the keyword (case-insensitive) will be obtained. Same holds for prominent words/internal linking suggestions.
That is wrong and we should be able to recognize word forms in the text. At least just as well as Google does.
We should add morphological analysis for the languages that are morphologically supported by Google (English, German, Russian, Dutch to begin with).
There are few ways to achieve that:
Ship a dictionary with all word forms to the user (based on the language (s)he writes in. This solution requires architectural changes (e.g., ability to ship components on demand, store them locally, but outside the browser). It also appears challenging to obtain open-source morphological dictionaries for all languages we are interested in. Otherwise, this method will probably generate the most accurate result for all languages.
Build all word forms within the plugin for the given keyword only. This solution will not require architectural changes, but some development within the plugin code (e.g., regexes for regular forms, word lists for irregulars). I expect that it will work well for languages with rather regular and poor morphology (English, Dutch), but it might be challenging (though not impossible) to achieve it for languages with somewhat more irregular and richer morphology (e.g., Russian). There are multiple projects on Git which we could use to help development, predominantly MIT licensed (references will be added).
Ask the user to fill in all the possible forms for a given keyword. This solution is probably the easiest to implement, but (1) it is error-prone; (2) it will only work for keywords, but not for internal linking suggestions. The accuracy of the method is difficult to estimate, it would depend on how aware the user is regarding morphology and is able to conjugate/inflect words of his/her native language.
Provisional conclusion: It might be meaningful to try implementing different methods and see which one works best. A realistic prospective is that we will need different methods for different languages (i.e., regexes for English/Dutch; dictionaries for Russian/German).
in progress
Refactor keyword-based assessments to accommodate morphology #1558keyword
are changed tokeyphrase
in YoastSEO.js (team plugin will make sure the same happens in Free, Premium & Components)