baygeldin / tantiny

Tiny full-text search for Ruby powered by Tantivy
MIT License
217 stars 8 forks source link

Custom tokenizer #17

Open morygonzalez opened 2 years ago

morygonzalez commented 2 years ago

I want to use Tantiny with Japanese. There are several Tantivy tokenizers for Japanese language. I'm now considering lindera-tantivy which supports not only Japanese but also Chinese and Korean. Is it possible to use these custom tokenizers with Tantivy via Tantiny?

baygeldin commented 2 years ago

Hey @morygonzalez, currently Tantiny does not support custom tokenizers. I had some ideas how to implement it, but it's a complex issue to tackle due to the fact that it requires extending behaviour in runtime which is not easy to do with Rust (let alone it's interaction with Ruby).

However, it seems that lidera is quite a useful project and it might make sense to just add a new tokenizer type to Tantiny that uses it. This is much easier than dealing with custom tokenizers. What do you think?

morygonzalez commented 2 years ago

@baygeldin Thank you! That's cool. I'm happy with your suggestion!!

baygeldin commented 2 years ago

Okay, I'll see what I can do, but probably after I deal with aggregations (or you can make a PR yourself if you want).

morygonzalez commented 2 years ago

I see. I'll try to make a Pull Request though I'm quite new to Rust then it'll take some time.