apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.68k stars 1.04k forks source link

NumericRange support for new query parser [LUCENE-1768] #2842

Closed asfimport closed 13 years ago

asfimport commented 15 years ago

It would be good to specify some type of "schema" for the query parser in future, to automatically create NumericRangeQuery for different numeric types? It would then be possible to index a numeric value (double,float,long,int) using NumericField and then the query parser knows, which type of field this is and so it correctly creates a NumericRangeQuery for strings like "[1.567..*]" or "(1.787..19.5]".

There is currently no way to extract if a field is numeric from the index, so the user will have to configure the FieldConfig objects in the ConfigHandler. But if this is done, it will not be that difficult to implement the rest.

The only difference between the current handling of RangeQuery is then the instantiation of the correct Query type and conversion of the entered numeric values (simple Number.valueOf(...) cast of the user entered numbers). Evenerything else is identical, NumericRangeQuery also supports the MTQ rewrite modes (as it is a MTQ).

Another thing is a change in Date semantics. There are some strange flags in the current parser that tells it how to handle dates.


Migrated from LUCENE-1768 by Uwe Schindler (@uschindler), resolved Sep 08 2011 Attachments: TestNumericQueryParser-fix.patch (versions: 4), week1.patch, week11-13_for_lucene_3x.patch (versions: 2), week-14.patch, week15_for_lucene_3x.patch, week15_for_trunk.patch, week2.patch, week3.patch, week4.patch, week5-6.patch, week-7.patch, week-8.patch Linked issues:

asfimport commented 13 years ago

Vinicius Barros (migrated from JIRA)

Sorry, I have been in silence for so long time, trying to get some results, but many doubts showed up, I need to guidance here.

First, I started working on implementing >=, <=, <, > and = operator for standard query parser. Then I later realized someone had submitted a patch for that already and I stopped working on it.

Then I decided to implement support for date resolutions for numeric queries in query parser. I started by changing NumberDateFormat to receive a resolution parameter (DAY, SECONDS, MINUTES, etc) and this new parameter is taken into account when doing the date conversion. For that, I added a new method to do the date rounding that takes TimeZone into account, since the current round methods in DateTools do not support timezone. I was able to make it work up to this part.

After that, I started to work on date compression, as you suggested before Uwe. For example, if the user wants a DAY resolution, NumberDateFormat should only return the number of days for the given date since 1970, not the miliseconds. For SECOND resolution, it's easy, just divide the miliseconds by 1000. For minutes the same, divide the miliseconds by 1000 * 60 and so on. However, when I got to month, I have no easy way to compress the miliseconds, I mean, no easy way to truncate the days and only get the month count since 1970. The only good solution I found was to get the number of years since 1970 and multiply by 12 plus the current month number.

I am still wondering if we can always assume that dividing the miliseconds by 1000 (sec), 60 (minute), 60 (hour) and 24 (day) will actually be precise. What about the leap second? Not sure if the miliseconds -> (defined_resolution) and (defined_resolution) -> miliseconds will always be correct. Maybe I am missing something or overcomplicating.

asfimport commented 13 years ago

Vinicius Barros (migrated from JIRA)

I also started working on applying the numeric support to 3x. However I am not sure about backwards compatibility there.

The problem is that in trunk I had renamed RangeQueryNode to TermRangeQueryNode. Also, TermRangeQueryNode no longer extends ParametricRangeQueryNode, it now extends AbstractRangeQueryNode. Because of that, ParametricRangeQueryNodeProcessor returns a TermRangeQueryNode instead of RangeQueryNode. As you can see, many things the user might expect to still work the same is working completely different. I see some classes in Lucene use Version, but I don't know exactly how that works and why standard query parser do not use it. Should it?

Not sure how I should proceed now. Should I ignore the backward compatibility and go ahead and change how everything behaves or try to make everything backward compatible (not sure how I could do that without the use of Version).

Do you have any comments on that Uwe?

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I see some classes in Lucene use Version, but I don't know exactly how that works and why standard query parser do not use it. Should it?

Version is inteneded to be used for behavioural changes to keep index compatibility, so people can use new Lucene versions without reindexing. It does not help for API changes (it could sometimes, but only for those cases where the API changes are something like: If versionA call method a else method b, if method a or b trigger different APIs).

Typical examples for Version are changes in tokenization (so most analyzers use it): When a bugfix in the analyzer produces different tokens than before the version flag is used to be able to enable the "buggy" behaviour, so querying your index with the "wrong" tokens still works. The core queryparser also uses it to change the behaviour of creating phrase queries (the flexible query parser is, as far as I know, still missing this).

I am away this weekend, I will come back to you on Monday for the other questions.

asfimport commented 13 years ago

Adriano Crestani (migrated from JIRA)

Good point, should flexible query parser be using version? The changes below might break users' code.

The question is: should we support versioning for the above changes or should we only document the changes and let the user be aware of such change?

I think we should be very careful and only use Version when it's really required (like things that imply in index changes). So I vote for only documenting the change and also try to avoid changes that will change the code behavior.

I see the classic QP only uses version to change the default value for certain instance attributes between 2.9/3.0 and 3.x versions. I think it's only there because 2.9 and 3.0 were required to be the same, except removed deprecations. I don't think 4.0 will have the same requirement, will it?

asfimport commented 13 years ago

Adriano Crestani (migrated from JIRA)

I took a look at the code and I have some suggestions to avoid doing to many changes:

I think it's a start, I haven't tried the changes, so I will let you(Vinicius) play with it. Let us know if it helped or if you hit any new problems ;)

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

for 3x, do not rename RangeQueryNode to TermRangeQueryNode, just deprecate it and document it saying it will change name in future (Uwe, can you confirm this is the right procedure for class renaming?!).

I can confirm this, we do it generally a little bit different, there are two possibilities (depending on if the user will create instances of this class and pass it to the API or if the API returns the instance):

I agree with the rest of Adriano's comments.

In general this is a contrib module and contrib modules don't have the strict backwards requirements like core classes. Because of this I would only provide minimal backwards layers (no sophisticated ones) and document all changes. Backwards compatibility sometimes get a pain, so we even document and break backwards in core (sometimes). We list all those breaks in the backwards section. Backwards breaks that can be fixed by recompilation are very minor and should only be documented (as drop-in-JAR replacements no longer work), so they are no issue at all.

I see the classic QP only uses version to change the default value for certain instance attributes between 2.9/3.0 and 3.x versions. I think it's only there because 2.9 and 3.0 were required to be the same, except removed deprecations. I don't think 4.0 will have the same requirement, will it?

4.0 breaks backwards completely with no deprecation layers. The core API of Lucene changed hard (new enum types, new structures, totally new API, change from char[] to byte[],... the list is very long). We have a MIGRATION.txt that explains all changes.

We only add deprecation layers in 3.x core for cases where very high-level user classes are affected (e.g. IndexSearcher.search() methods, so transition is easy). All other places in Lucene that are more expert will change as described before.

asfimport commented 13 years ago

Vinicius Barros (migrated from JIRA)

Hi Uwe and Adriano,

Thanks for the comments.

I have been trying to follow Adriano's instructions to avoid major changes to API and behavior. However, ParametricRangeQueryNode and AbstractRangeQueryNode do not share the same methods (as Adriano said above). ParametricRangeQueryNode has lower and upper bound, which are ParametricQueryNodes, and these last ones hold the inclusive/exclusive information (they use CompareOperator for that). AbtractRangeQueryNode have lower and upper bounds (these ones defined by the template, so it could be a ParametricQueryNode, which is compatible with ParametricRangeQueryNode), however it holds the inclusive/exclusive information differently, through isLowerInclusive and isUpperInclusive methods.

I just don't understand why ParametricQueryNode hold this CompareOperator value, I think this should be part of the ParametricRangeQueryNode and ParametricQueryNode should only be a simple value (FieldQueryNode). Any suggestions here?

asfimport commented 13 years ago

Adriano Crestani (migrated from JIRA)

I just don't understand why ParametricQueryNode hold this CompareOperator value, I think this should be part of the ParametricRangeQueryNode and ParametricQueryNode should only be a simple value (FieldQueryNode). Any suggestions here?

That's a looong story. And you are right, they are not compatible (my fault) and CompareOperator does not make much sense. Today, if you want, you can set a ParametricRangeQueryNode with CompareOperator.LE set in the two bounds :S. Lets take the opportunity and try to redesign it.

I also agree that ParametricRangeQueryNode could only have FieldQueryNode as its bounds, so I think we can get rid of ParametricQueryNode (for 4.0). For now, I will suggest the following change:

I think this way we can get rid of ParametricQueryNode in future. I don't know the implications of these changes, since I haven't tried it. So let me know if you find any new conflict.

asfimport commented 13 years ago

Vinicius Barros (migrated from JIRA)

This patches backports the numeric support to 3x.

Uwe and Adriano, please, review it and let me know if the backports looks OK.

Thanks!

asfimport commented 13 years ago

Adriano Crestani (migrated from JIRA)

Hi Vinicius,

Good work, the patch looks good. I saw you named the new interface RangeQueryNode as well. I don't see any problem, since they are in different packages.

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I have no problems with the patch, too. We have unfortunately no real backwards tests for contribs, but I see no major problems.

I will commit this soon and record merges from trunk, so the commits are related by mergeprops. What changes are still missing from trunk that will never be backported?

Adriano: Can you summarize all your changes in both issues into a changes.txt entry for both branches?

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

In 3.x we depend on Java 5, I found two interface @Override in your patch, that are not detected by Java6's javac (it's a bug there).

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Attached a patch that fixes the interface @Override, including all merge-props needed.

asfimport commented 13 years ago

Adriano Crestani (migrated from JIRA)

Adriano: Can you summarize all your changes in both issues into a changes.txt entry for both branches?

Hi Uwe, yes, I can do it, I will need to review the entire code again (3x and trunk). I plan to do this during the weekend. But if you want, commit Vinicius's code and I commit the changes to changes.txt later ;)

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

OK. Of course, also Vinicius can help :-)

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Committed 3.x revision: 1159411

I keep this issue open until CHANGES.txt for trunk and 3.x is finished (including upgrade guidelines for backwards breaks).

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Adriano: Can you take this issue, because I will fly to California on Sunday and have no time next week? Of course, I will fill out the GSoC evaluation form once it is available online.

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I had to fix Javadoc warnings in 3.x revision: 1159420

asfimport commented 13 years ago

Vinicius Barros (migrated from JIRA)

Thanks for committing the patch Uwe.

I tried to summarize everything I have changed so far for this JIRA. I hope this helps:

asfimport commented 13 years ago

Vinicius Barros (migrated from JIRA)

Right now I am reviewing the API changes I did in 3x and trying to replicate it back to trunk (removing deprecated classes, etc). I should have a new patch soon.

asfimport commented 13 years ago

Adriano Crestani (migrated from JIRA)

Hi Vinicius,

Thanks for the info, but maybe it is too much details for the changes.txt. Let me try to simplify it.

(3.x)

New features: LUCENE-1768: added support for numeric ranges in contrib query parser; added support for simple numeric queries, such as <age:4>, in contrib query parser (Vinicius Barros via Uwe Schindler)

Changes in runtime behavior: LUCENE-1768: StandardQueryConfigHandler now uses NumericFieldConfigListener to set a NumericConfig to its corresponding FieldConfig; StandardQueryTreeBuilder now uses DummyQueryNodeBuilder for NumericQueryNodes and uses NumericRangeQueryNodeBuilder for NumericRangeQueryNodes; StandardQueryNodeProcessorPipeline now executes NumericQueryNodeProcessor followed by NumericRangeQueryNodeProcessor right after LowercaseExpandedTermsQueryNodeProcessor; (Vinicius Barros via Uwe Schindler)

API changes: LUCENE-1768: setNumericConfigMap and getNumericConfigMap were added to StandardQueryParser; ParametricRangeQueryNode and oal.queryParser.standard.nodes.RangeQueryNode now implement oal.queryParser.core.nodes.RangeQueryNode; oal.queryParser.core.nodes.RangeQueryNode was deprecated and now extends TermRangeQueryNode, which extends AbstractRangeQueryNode; ParametricQueryNode was deprecated; FieldQueryNode now implements the new FieldValueQueryNode<CharSequence>, which this last one implements FieldableQueryNode and thew new ValueQueryNode; (Vinicius Barros via Uwe Schindler)

(trunk)

Changes in runtime behavior: LUCENE-1768: StandardQueryTreeBuilder uses RangeQueryNodeBuilder for RangeQueryNodes, since theses two classes were removed; ParametricRangeQueryNodeProcessor now creates TermRangeQueryNode, instead of RangeQueryNode, from ParametricRangeQueryNode (Vinicius Barros via Uwe Schindler)

API Changes: LUCENE-1768: ParametricRangeQueryNode now implements RangeQueryNode<FieldQueryNode> instead of RangeQueryNode<ParametricQueryNode> (Vinicius Barros via Uwe Schindler)

Vinicius, please, review the summary I wrote above, I hope I could simplify correctly what you summarized. The API Changes for trunk I am assuming you are going to remove ParametricQueryNode(deprecated) and replace by FieldQueryNode in that statement.

Uwe, I hope this helps :)

asfimport commented 13 years ago

Vinicius Barros (migrated from JIRA)

I have changed again trunk to reflect the latest changes on 3x.

I also realized that ParametricRangeQueryNode and TermRangeQueryNode are basically the same, so I decided to removed ParametricRangeQueryNode and only use TermRangeQueryNode. Everything seems to be working fine. Let me know if I should not have done that for any reason.

I have some few changes on 3x to submit yet, but I was wondering: is it necessary to deprecate a class in 3x if it's ONLY going to be removed in 4.0? Not sure if I understand how these things work yet.

asfimport commented 13 years ago

Vinicius Barros (migrated from JIRA)

Thanks Adriano, your summary looks better!

To trunk behavior changes, I would add that "StandardSyntaxParser" no longer outputs ParametricRangeQueryNodes for range queries, instead, it outputs TermRangeQueryNodes.

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi Vinicius, any news for trunk? We want to release Lucene 3.4.0, just to confirm: Are we finished with the 3.x branch? I will just quickly add the changes entry as suggested by Adriano.

asfimport commented 13 years ago

Vinicius Barros (migrated from JIRA)

Sorry Uwe, college is taking my time again.

Here are the latest changes, for trunk and 3x.

Tests are passing for both.

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi Adriano,

thanks. For applying the patch you have to first do a svn rename: modules/queryparser/src/java/org/apache/lucene/queryparser/flexible/standard/processors/ParametricRangeQueryNodeProcessor.java -> modules/queryparser/src/java/org/apache/lucene/queryparser/flexible/standard/processors/TermRangeQueryNodeProcessor.java

After doing this, the patch applies. I will verify the changes and respond back later. I will apply the 3.x javadoc/code changes to the 3.x and 3.4 branches.

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

The cleanups in trunk look fine, I think we can commit them. The tests did not change so I am confident that they break nothing.

I think we can resolve this issue after those. Nice work!

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Committed final changes to trunk revision: 1166789 Committed Javadocs/deprecation changes + changes.txt to 3.x branch revision: 1166536; 3.4 release branch revision: 1166538

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

If you have any more comments, please reopen, I think this is resolved now.

Thanks Vinicius, good work!

asfimport commented 13 years ago

Vinicius Barros (migrated from JIRA)

Hi Uwe,

That's it. No, I don't have any other changes in my workspace.

It was nice working with you, thanks!

asfimport commented 9 years ago

Michael Osipov (@michael-o) (migrated from JIRA)

Though this is old and Lucene 3.x is not supported anymore, We have found a bug in our old indexing service cause by the RangeQueryNode. It has been incorrectly implemented back in 3.4. The logic is reversed compared to the fix in 4.0 thus making it unusable.

That is the logic in 3.x:

super(lower, upper, lower.getOperator() == CompareOperator.LE, upper
        .getOperator() == CompareOperator.GE);

and the logic in 4.x:

return new TermRangeQueryNode(lower, upper,
          lower.getOperator() == CompareOperator.GE,
          upper.getOperator() == CompareOperator.LE);