Stratio / cassandra-lucene-index

Lucene based secondary indexes for Cassandra
Apache License 2.0
600 stars 171 forks source link

complex analyzer builder analyzer without requiring a custom jar addition #306

Open rmannibucau opened 7 years ago

rmannibucau commented 7 years ago

Idea is to enable to configure through the json payload passed when creating the index/analyzers some more advanced and open configuration. Here is a sample:

{
  "type": "complex",
  "tokenizer": {
    "class": "ngram",
    "parameters": [
      "1",
      "2"
    ]
  },
  "token_streams": [
    {
      "class": "stop",
      "parameters": [
        null,
        "a,an,and,are,as,at,be,but,by,for,if,in,into,is,it,no,not,of,on,or,such,that,the,their,then,there,these,they,this,to,was,will,with"
      ]
    },
    {
      "class": "org.apache.lucene.analysis.core.LowerCaseFilter",
      "parameters": [
        null
      ]
    },
    {
      "class": "org.apache.lucene.analysis.standard.StandardFilter",
      "parameters": [
        null
      ]
    }
  ]
}

And here is the equivalent java code:

public classMyAnalyzer extends Analyzer {
    @Override
    protected TokenStreamComponents createComponents(final String field) {
        final Tokenizer source = new NGramTokenizer(1, 2);
        final TokenStream result = new StopFilter(
                new LowerCaseFilter(new StandardFilter(source)),
                new CharArraySet(asList(/*list of stop words*/), true));
        return new TokenStreamComponents(source, result);
    }
}
ealonsodb commented 7 years ago

Hi @rmannibucau: Please excuse my late answer. Grab a coffee , this is going to be long.

We need to admit that this is an awesome feature. It gives user the ability to create custom analyzers just like elastic. We definitely want this feature included in the project without any doubts.

There are some things we want you to change before merging this pr:

I attach two pictures that would explain this better:

abstract_tokenizer_builder

child_tokenizer_builder

I have started to develop my own version (what images shows) of this feature in branch feature/build_custom_analyzer. I will upload it

soon.

I can continue with this but it is your decision. Do you want to change your code to meet our requirements with me as the reviewer or do you want me to develop this with you as reviewer?

Looking forward your answer.

rmannibucau commented 7 years ago

Hmm, have to admit I dont know how you can achieve json validation and opening of the instantiated instance cause one of the goal was to ensure we have shortcuts for well know instances/types - where this fully makes sense to use inheritance - but also customs or evolutive ones - where you cant validate it anyway.

What would be the plan for such support?

Also not sure I get the versioning (that's surely why i used master): why 3.0 and not 3.10?

If you are on that I guess you'll be faster than me so happy to close that otherwise I can give it a try in a few weeks probably (can't at the moment).