atilika / kuromoji

Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search
Apache License 2.0
950 stars 131 forks source link

How to enable discardPunctuation in Kuromoji Java #134

Open yanghanxy opened 3 years ago

yanghanxy commented 3 years ago

Hi,

I can remove the punctuations in Kuromoji-ES plugin by setting "discard_punctuation": "true"

I'm wondering how can I get the same result with Kuromoji-Java?

For example, in Kuromoji-ES, 「浅草」駅 will be tokenized as

{
  "tokens" : [
    {
      "token" : "浅草",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "駅",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    }
  ]
}

Is there a same function with Kuromoji-Java to do so?