MihaiValentin / lunr-languages

A collection of languages stemmers and stopwords for Lunr Javascript library
Other
431 stars 163 forks source link

Support for position metadata in Japanese #38

Closed danjarvis closed 7 years ago

danjarvis commented 7 years ago

Starting with Lunr 2.0.0 you can opt in to use additional search result metadata. If you take a look at the latest demo app you can see how to leverage the new match position metadata.

I am using this feature for English search, however the position metadata is returned as null when searching against the Japanese index. I haven't had a chance to take a look into the details to try and fix it myself. I am submitting this issue here for visibility. I'll update it when I have more information.

Thank you!

olivernn commented 7 years ago

I think the problem is that the Japanese language plugin must provide its own tokenizer, due to the extra work required for correctly tokenising Japanese text. This means it isn't recording the same metadata that the built in tokeniser does.

Here is how it is done in Lunr - https://github.com/olivernn/lunr.js/blob/master/lib/tokenizer.js#L41-L46 And here is the tokeniser in the Japanese adaptor - https://github.com/MihaiValentin/lunr-languages/blob/master/lunr.ja.js#L105

So you would have to understand how to get positions out of tool being used to segment the Japanese text, I'm not familiar with that library so unable to say how easy that would be, but adding them to the index is just a matter of passing the details to the lunr.Token constructor

danjarvis commented 7 years ago

Hey @olivernn!

I was just looking through the JA plugin for lunr-languages and after comparing it with the base tokenizer code for EN it seems pretty straightforward to attach the position metadata for Japanese. I'm going to try and work through it tomorrow and I'll update this issue accordingly.

Thank you for all your work on lunr!