Closed zhshch2002 closed 2 years ago
Welcome, and thanks for the kind words.
So, to be honest this part of analysis pipeline isn't really used for much. The original thinking was that when tokenizing text, occasionally it is helpful for some downstream component to know the "type" of token it is. But, in practice this is used for very few things. One example I could find was the CJK bi-gram filter. It output pairs of characters, but needs to handle the case where ideographic tokens are mixed with alphanumeric ones: https://github.com/blugelabs/bluge/blob/master/analysis/lang/cjk/cjk_bigram.go#L42
You can see how the regular expression tokenizer does this here: https://github.com/blugelabs/bluge/blob/master/analysis/tokenizer/regexp.go#L55
But, you can also see that this behavior is just ad-hoc. The unicode tokenizer will produce ideographic tokens, but the regular expression tokenizer never does.
As you mentioned you are building your own analyzer, I don't think is that important. But, if you think this is important to your analyzer, and you have ideas to improve it, we should discuss it further.
Thank you for your reply, I get it.
I am planning to build a light weight Chinese search component. This is why I am curious about this. Currently still working on feasibility and technology stack. This project gives me a lot of confidence and allow me to make great progress.
I guess I could close this issue for now and I'll come back if there are any new questions.
If I want to write a analyzer, I would like to know the definition of TokenType. What is this specification or the definition of this project.
form https://github.com/blugelabs/bluge/blob/6208d09eaf0ea1cc821a6d848b7b9ac1977729dc/analysis/type.go#L27
btw, I think this is an amazing project