blugelabs / bluge

indexing library for Go
Apache License 2.0
1.9k stars 125 forks source link

Where can I find the definition of TokenType? #108

Closed zhshch2002 closed 2 years ago

zhshch2002 commented 2 years ago

If I want to write a analyzer, I would like to know the definition of TokenType. What is this specification or the definition of this project.

const (
    AlphaNumeric TokenType = iota
    Ideographic
    Numeric
    DateTime
    Shingle
    Single
    Double
    Boolean
)

form https://github.com/blugelabs/bluge/blob/6208d09eaf0ea1cc821a6d848b7b9ac1977729dc/analysis/type.go#L27

btw, I think this is an amazing project

mschoch commented 2 years ago

Welcome, and thanks for the kind words.

So, to be honest this part of analysis pipeline isn't really used for much. The original thinking was that when tokenizing text, occasionally it is helpful for some downstream component to know the "type" of token it is. But, in practice this is used for very few things. One example I could find was the CJK bi-gram filter. It output pairs of characters, but needs to handle the case where ideographic tokens are mixed with alphanumeric ones: https://github.com/blugelabs/bluge/blob/master/analysis/lang/cjk/cjk_bigram.go#L42

You can see how the regular expression tokenizer does this here: https://github.com/blugelabs/bluge/blob/master/analysis/tokenizer/regexp.go#L55

But, you can also see that this behavior is just ad-hoc. The unicode tokenizer will produce ideographic tokens, but the regular expression tokenizer never does.

As you mentioned you are building your own analyzer, I don't think is that important. But, if you think this is important to your analyzer, and you have ideas to improve it, we should discuss it further.

zhshch2002 commented 2 years ago

Thank you for your reply, I get it.

I am planning to build a light weight Chinese search component. This is why I am curious about this. Currently still working on feasibility and technology stack. This project gives me a lot of confidence and allow me to make great progress.

I guess I could close this issue for now and I'll come back if there are any new questions.