MihaiValentin / lunr-languages

A collection of languages stemmers and stopwords for Lunr Javascript library
Other
431 stars 163 forks source link

"multiLanguage" method doesn't work with Japanese. #45

Open rikuson opened 6 years ago

rikuson commented 6 years ago

Hi, I think "multiLanguage" method doesn't work with Japanese.

<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8" />
    <title>Lunr multi-language demo</title>
    <script src="./src/lunr/lunr.js"></script>
    <script src="./src/lunr-languages/lunr.stemmer.support.js"></script>
    <script src="./src/lunr-languages/tinyseg.js"></script>
    <script src="./src/lunr-languages/lunr.ja.js"></script>
    <script src="./src/lunr-languages/lunr.multi.js"></script>
  </head>
  <body>
    <p>Open developer tools and observe the results in the Console tab. View source for code.</p>
    <script>
      /* init lunr */
      var idxEn = lunr(function () {
        this.field('body')
        this.add({"body": "この文章は日本語で書かれています。", "id": 1})
        this.add({"body": "This text is written in the English language.", "id": 2})
      });
      var idxJp = lunr(function () {
        this.use(lunr.ja);
        this.field('body')
        this.add({"body": "この文章は日本語で書かれています。", "id": 1})
        this.add({"body": "This text is written in the English language.", "id": 2})
      });
      var idxMulti = lunr(function () {
        this.use(lunr.multiLanguage('en', 'ja'));
        this.field('body')
        this.add({"body": "この文章は日本語で書かれています。", "id": 1})
        this.add({"body": "This text is written in the English language.", "id": 2})
      });
      console.log('Search for `日本語` (English pipeline): ', idxEn.search('日本語'));
      console.log('Search for `languages` (English pipeline): ', idxEn.search('languages'));
      console.log('Search for `日本語` (Japanese pipeline): ', idxJp.search('日本語'));
      console.log('Search for `languages` (Japanese pipeline): ', idxJp.search('languages'));
      console.log('Search for `日本語` (Jp + En pipeline): ', idxMulti.search('日本語'));
      console.log('Search for `languages` (Jp + En pipeline): ', idxMulti.search('languages'));
    </script>
  </body>
</html>

And the console showed like this.

1 Search for 日本語 (English pipeline): Array [] 2 Search for languages (English pipeline): Array [ {…} ] 3 Search for 日本語 (Japanese pipeline): Array [ {…} ] 4 Search for languages (Japanese pipeline): Array [] 5 Search for 日本語 (Jp + En pipeline): Array [] 6 Search for languages (Jp + En pipeline): Array [ {…} ]

I guess "5" should return Array [ {…} ].

I've tried demo and it worked. And the console showed like this.

1 Search for Русских (English pipeline): Array [] 2 Search for languages (English pipeline): Array [ {…} ] 3 Search for Русских (Russian pipeline): Array [ {…} ] 4 Search for languages (Russian pipeline): Array [] 5 Search for Русских (Ru + En pipeline): Array [ {…} ] 6 Search for languages (Ru + En pipeline): Array [ {…} ]

Thank you.

railsstudent commented 5 years ago

The stemmer function of Japanese looks like the following: lunr.jp.stemmer = (function() {

        /* TODO japanese stemmer  */
        return function(word) {
            return word;
        }
    })();

Could it be the cause of empty array?

My work projects also requires Japanese search capability and the result is less than ideal.

railsstudent commented 5 years ago

For Japanese, lunr behaves as exact search instead of index search.

skoji commented 4 years ago

I found that use(lunr.multiLanguage('ja')); is worse than use(lunr.ja)

Here is a slightly modified version of @rikuson 's example https://jsbin.com/hanacul/edit?html,output

biosocket commented 4 years ago

It seems like the Japanese segmenter is not running on the search string. If you search for more than one Japanese word without a space between the words (which is how the Japanese write, right?), no results are returned.

To clarify, when you index the sentence, この文章は日本語で書かれています, the segmenter breaks it down into: い て ます れ 文章 日本語 書か

But if you search "書かれ" without a space between the phrases, nothing is found.

knubie commented 3 years ago

This is due to the fact that japanese implements a custom tokenizer in addition to the standard pipeline functions. The multilang plugin only adds the pipeline functions for each language, and doesn't touch the tokenizer. The multilang plugin should be modified to compose all of the tokenizers in the language list.

For the record this is how to work around the issue:

var lunr = require('./lib/lunr.js');
require('./lunr.stemmer.support.js')(lunr);
require('./lunr.ru.js')(lunr);
require('./lunr.multi.js')(lunr);

var idx = lunr(function () {
  // the reason "en" does not appear above is that "en" is built in into lunr js
  this.use(lunr.multiLanguage('en', 'ru'));
  // Compose the japanese tokenizer with the built-in tokenizer
  this.tokenizer = function(x) {
    lunr.ja.tokenizer(x).concat(lunr.tokenizer(x));
  };
  // then, the normal lunr index initialization
  // ...
});
simonbate commented 2 years ago

I've attempted this fix, but it fails. It's unclear why the example uses 'ru', where I would think we want 'ja' throughout. It failed when I tried using 'ru' as indicated; it also fails when using 'ja' where it would seem more appropriate.

This is the error I get when generating the index:

/Users/simonfbate/node_modules/lunr/lunr.js:673     for (var j = 0; j < tokens.length; j++) {                                            ^ TypeError: Cannot read properties of undefined (reading 'length') at lunr.Pipeline.run (/Users/simonfbate/node_modules/lunr/lunr.js:673:32) at lunr.Builder.add (/Users/simonfbate/node_modules/lunr/lunr.js:2482:31) at lunr.Builder. (/Users/simonfbate/Documents/!SimonWork/DITA-OT/out/XXX_jp/ui/js/build_index.js:53:12) at Array.forEach () at lunr.Builder. (/Users/simonfbate/Documents/!SimonWork/DITA-OT/out/XXX_jp/ui/js/build_index.js:52:15) at lunr (/Users/simonfbate/nodemodules/lunr/lunr.js:53:10) at Socket. (/Users/simonfbate/Documents/!SimonWork/DITA-OT/out/XXX_jp/ui/js/build_index.js:36:13) at Socket.emit (node:events:532:35) at endReadableNT (node:internal/streams/readable:1346:12) at processTicksAndRejections (node:internal/process/task_queues:83:21)

This is due to the fact that japanese implements a custom tokenizer in addition to the standard pipeline functions. The multilang plugin only adds the pipeline functions for each language, and doesn't touch the tokenizer. The multilang plugin should be modified to compose all of the tokenizers in the language list.

For the record this is how to work around the issue:

var lunr = require('./lib/lunr.js');
require('./lunr.stemmer.support.js')(lunr);
require('./lunr.ru.js')(lunr);
require('./lunr.multi.js')(lunr);

var idx = lunr(function () {
  // the reason "en" does not appear above is that "en" is built in into lunr js
  this.use(lunr.multiLanguage('en', 'ru'));
  // Compose the japanese tokenizer with the built-in tokenizer
  this.tokenizer = function(x) {
    lunr.ja.tokenizer(x).concat(lunr.tokenizer(x));
  };
  // then, the normal lunr index initialization
  // ...
});
knubie commented 2 years ago

@simonbate Sorry, the ru in the example is probably a typo. I think I also forgot to add a return statement to the function (I normally write clojure and ruby which don't need returns, so that often trips me up).

var lunr = require('./lib/lunr.js');
require('./lunr.stemmer.support.js')(lunr);
require('./lunr.ja.js')(lunr);
require('./lunr.multi.js')(lunr);

var idx = lunr(function () {
  // the reason "en" does not appear above is that "en" is built in into lunr js
  this.use(lunr.multiLanguage('en', 'ja'));
  // Compose the japanese tokenizer with the built-in tokenizer
  this.tokenizer = function(x) {
    return lunr.tokenizer(x).concat(lunr.ja.tokenizer(x));
  };
  // then, the normal lunr index initialization
  // ...
});
simonbate commented 2 years ago

THANK YOU! Works great now.

Simon