Closed smiklosovic closed 7 years ago
Hi
As stated in doc, you can only use Damerau–Levenshtein or Levenshtein for distance calculation in fuzzy queries.
what do I have to do in order to support that? I dont mind to implement it myself and fork this.
Where it is specified which one will be used?
Hi @smiklosovic :
As you probably know, the search library used in this project is Apache Lucene.
There are four steps:
First, you should implement a custom JaroWinklerQuery
in lucene as recommended in this thread in its user list. Please use the lucene version you are using.
Second , you need to code your own JaroWinklerCondition
in cassandra-lucene-index/plugin, just like the fuzzy condition, with its builder. You also need to choose a custom condition type in ConditionBuilder just to determine the type of the query. You could add a unit test like FuzzyConditionTest
Third, you should include the JaroWinklerCondition
builder in cassandra-lucene-index/builder just like FuzzyCondition and add a method to Builder to be able to create that query with builder submodule library. There are unit test in builder too.
And as a last step, you could add a acceptance test like FuzzySearchIT in cassandra-lucene-index/testsAT
First step is hard to achieve, others are easy.
Hope this helps
Thanks a lot for your detailed answer.
What I do not understand is how the implementation in this plugin enables me to search for that. I mean, by creating JaroWinklerQuery, that query itself would be saved where exactly and how it would be used from this plugin? Any hints please?
Has anybody ever written Jaro-Winkler for Lucene so I can just more or less embed it?
I know there is this (1). Could that be helpful? What connection it would have to JaroWinklerQuery?
And finally, where did you specify that Damerau-Levenstain will be used automatically?
(1) http://lucene.apache.org/core/6_4_0/suggest/org/apache/lucene/search/spell/JaroWinklerDistance.html
Hi @smiklosovic,
This plugin allows you translate Cassandra data and queries into Apache Lucene queries. Apache Lucene is another project that does not support Jaro-Winkler distance-based queries. So, what @ealonsodb suggest you is to create your own Lucene query implementation and write the plugin classes wrapping it.
Regarding where the query will be stored, the new query type would be a piece of code added to the existing code of the plugin, and will be part of the Java JAR file that the code of this plugin becomes when compiled.
About if someone has written any Lucene query using Jaro-Winkler, you can take a look to Lucene documentation and address any question to the Apache Lucene user community. Once you have the query working in Lucene itself, we can provide help with the integration with Cassandra, which is in the scope of this project.
We don't specify anywhere that Damerau-Levenstain will be used, we just use Lucene's FuzzyQuery implementation, which only implements this kind of distance function. You can take a look to Lucene's code and try to modify their implementation.
About the linked Jaro-Winkler implementation, I think it is used with a very different purpose, being related to spell checking and not to fuzzy queries. Is a fuzzy query using Jaro-Winkler distance what you want?
We are using Jaro-Winkler instead of that Damerau-Levenstain.
Is it possible to switch between these two?