Open rikuson opened 6 years ago
The stemmer function of Japanese looks like the following: lunr.jp.stemmer = (function() {
/* TODO japanese stemmer */
return function(word) {
return word;
}
})();
Could it be the cause of empty array?
My work projects also requires Japanese search capability and the result is less than ideal.
For Japanese, lunr behaves as exact search instead of index search.
I found that use(lunr.multiLanguage('ja'));
is worse than use(lunr.ja)
Here is a slightly modified version of @rikuson 's example https://jsbin.com/hanacul/edit?html,output
It seems like the Japanese segmenter is not running on the search string. If you search for more than one Japanese word without a space between the words (which is how the Japanese write, right?), no results are returned.
To clarify, when you index the sentence, この文章は日本語で書かれています, the segmenter breaks it down into: い て ます れ 文章 日本語 書か
But if you search "書かれ" without a space between the phrases, nothing is found.
This is due to the fact that japanese implements a custom tokenizer in addition to the standard pipeline functions. The multilang plugin only adds the pipeline functions for each language, and doesn't touch the tokenizer. The multilang plugin should be modified to compose all of the tokenizers in the language list.
For the record this is how to work around the issue:
var lunr = require('./lib/lunr.js');
require('./lunr.stemmer.support.js')(lunr);
require('./lunr.ru.js')(lunr);
require('./lunr.multi.js')(lunr);
var idx = lunr(function () {
// the reason "en" does not appear above is that "en" is built in into lunr js
this.use(lunr.multiLanguage('en', 'ru'));
// Compose the japanese tokenizer with the built-in tokenizer
this.tokenizer = function(x) {
lunr.ja.tokenizer(x).concat(lunr.tokenizer(x));
};
// then, the normal lunr index initialization
// ...
});
I've attempted this fix, but it fails. It's unclear why the example uses 'ru', where I would think we want 'ja' throughout. It failed when I tried using 'ru' as indicated; it also fails when using 'ja' where it would seem more appropriate.
This is the error I get when generating the index:
/Users/simonfbate/node_modules/lunr/lunr.js:673
for (var j = 0; j < tokens.length; j++) {
^
TypeError: Cannot read properties of undefined (reading 'length')
at lunr.Pipeline.run (/Users/simonfbate/node_modules/lunr/lunr.js:673:32)
at lunr.Builder.add (/Users/simonfbate/node_modules/lunr/lunr.js:2482:31)
at lunr.Builder.
This is due to the fact that japanese implements a custom tokenizer in addition to the standard pipeline functions. The multilang plugin only adds the pipeline functions for each language, and doesn't touch the tokenizer. The multilang plugin should be modified to compose all of the tokenizers in the language list.
For the record this is how to work around the issue:
var lunr = require('./lib/lunr.js'); require('./lunr.stemmer.support.js')(lunr); require('./lunr.ru.js')(lunr); require('./lunr.multi.js')(lunr); var idx = lunr(function () { // the reason "en" does not appear above is that "en" is built in into lunr js this.use(lunr.multiLanguage('en', 'ru')); // Compose the japanese tokenizer with the built-in tokenizer this.tokenizer = function(x) { lunr.ja.tokenizer(x).concat(lunr.tokenizer(x)); }; // then, the normal lunr index initialization // ... });
@simonbate Sorry, the ru
in the example is probably a typo. I think I also forgot to add a return
statement to the function (I normally write clojure and ruby which don't need returns, so that often trips me up).
var lunr = require('./lib/lunr.js');
require('./lunr.stemmer.support.js')(lunr);
require('./lunr.ja.js')(lunr);
require('./lunr.multi.js')(lunr);
var idx = lunr(function () {
// the reason "en" does not appear above is that "en" is built in into lunr js
this.use(lunr.multiLanguage('en', 'ja'));
// Compose the japanese tokenizer with the built-in tokenizer
this.tokenizer = function(x) {
return lunr.tokenizer(x).concat(lunr.ja.tokenizer(x));
};
// then, the normal lunr index initialization
// ...
});
THANK YOU! Works great now.
Simon
Hi, I think "multiLanguage" method doesn't work with Japanese.
And the console showed like this.
I guess "5" should return
Array [ {…} ]
.I've tried demo and it worked. And the console showed like this.
Thank you.