Stratio / cassandra-lucene-index

Lucene based secondary indexes for Cassandra
Apache License 2.0
600 stars 171 forks source link

Different algorithm for fuzzy search #271

Closed smiklosovic closed 7 years ago

smiklosovic commented 7 years ago

We are using Jaro-Winkler instead of that Damerau-Levenstain.

Is it possible to switch between these two?

ealonsodb commented 7 years ago

Hi

As stated in doc, you can only use Damerau–Levenshtein or Levenshtein for distance calculation in fuzzy queries.

smiklosovic commented 7 years ago

what do I have to do in order to support that? I dont mind to implement it myself and fork this.

Where it is specified which one will be used?

ealonsodb commented 7 years ago

Hi @smiklosovic :

As you probably know, the search library used in this project is Apache Lucene.

There are four steps:

First, you should implement a custom JaroWinklerQuery in lucene as recommended in this thread in its user list. Please use the lucene version you are using.

Second , you need to code your own JaroWinklerCondition in cassandra-lucene-index/plugin, just like the fuzzy condition, with its builder. You also need to choose a custom condition type in ConditionBuilder just to determine the type of the query. You could add a unit test like FuzzyConditionTest

Third, you should include the JaroWinklerCondition builder in cassandra-lucene-index/builder just like FuzzyCondition and add a method to Builder to be able to create that query with builder submodule library. There are unit test in builder too.

And as a last step, you could add a acceptance test like FuzzySearchIT in cassandra-lucene-index/testsAT

First step is hard to achieve, others are easy.

Hope this helps

smiklosovic commented 7 years ago

Thanks a lot for your detailed answer.

What I do not understand is how the implementation in this plugin enables me to search for that. I mean, by creating JaroWinklerQuery, that query itself would be saved where exactly and how it would be used from this plugin? Any hints please?

Has anybody ever written Jaro-Winkler for Lucene so I can just more or less embed it?

I know there is this (1). Could that be helpful? What connection it would have to JaroWinklerQuery?

And finally, where did you specify that Damerau-Levenstain will be used automatically?

(1) http://lucene.apache.org/core/6_4_0/suggest/org/apache/lucene/search/spell/JaroWinklerDistance.html

adelapena commented 7 years ago

Hi @smiklosovic,

This plugin allows you translate Cassandra data and queries into Apache Lucene queries. Apache Lucene is another project that does not support Jaro-Winkler distance-based queries. So, what @ealonsodb suggest you is to create your own Lucene query implementation and write the plugin classes wrapping it.

Regarding where the query will be stored, the new query type would be a piece of code added to the existing code of the plugin, and will be part of the Java JAR file that the code of this plugin becomes when compiled.

About if someone has written any Lucene query using Jaro-Winkler, you can take a look to Lucene documentation and address any question to the Apache Lucene user community. Once you have the query working in Lucene itself, we can provide help with the integration with Cassandra, which is in the scope of this project.

We don't specify anywhere that Damerau-Levenstain will be used, we just use Lucene's FuzzyQuery implementation, which only implements this kind of distance function. You can take a look to Lucene's code and try to modify their implementation.

About the linked Jaro-Winkler implementation, I think it is used with a very different purpose, being related to spell checking and not to fuzzy queries. Is a fuzzy query using Jaro-Winkler distance what you want?