NumericRange support for new query parser [LUCENE-1768]

asfimport commented 15 years ago

It would be good to specify some type of "schema" for the query parser in future, to automatically create NumericRangeQuery for different numeric types? It would then be possible to index a numeric value (double,float,long,int) using NumericField and then the query parser knows, which type of field this is and so it correctly creates a NumericRangeQuery for strings like "[1.567..*]" or "(1.787..19.5]".

There is currently no way to extract if a field is numeric from the index, so the user will have to configure the FieldConfig objects in the ConfigHandler. But if this is done, it will not be that difficult to implement the rest.

The only difference between the current handling of RangeQuery is then the instantiation of the correct Query type and conversion of the entered numeric values (simple Number.valueOf(...) cast of the user entered numbers). Evenerything else is identical, NumericRangeQuery also supports the MTQ rewrite modes (as it is a MTQ).

Another thing is a change in Date semantics. There are some strange flags in the current parser that tells it how to handle dates.

Migrated from LUCENE-1768 by Uwe Schindler (@uschindler), resolved Sep 08 2011 Attachments: TestNumericQueryParser-fix.patch (versions: 4), week1.patch, week11-13_for_lucene_3x.patch (versions: 2), week-14.patch, week15_for_lucene_3x.patch, week15_for_trunk.patch, week2.patch, week3.patch, week4.patch, week5-6.patch, week-7.patch, week-8.patch Linked issues:

2641
- 2898
- 4411
- 4425

asfimport commented 15 years ago

Luis Alves (migrated from JIRA)

Uwe,

Thanks for creating the jira issue.

Can you add some simple query examples. What would be the lucene Query objects for those queries, if it was produce by a QP that supported that feature.

Also elaborate what is the current expect behavior for those queries.

If you can write a junit with one or 2 indexed docs, and a lucene Query that retrives just one of those docs and not the other without using the queryparser, that would be helpful.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Luis, I will post an example of queries and the constructed Query objects when I am back from holidays (Thursday+). In principle the syntax would be the same like for normal range queries, only that the min/max arguments may be double, float, int, long or dates. You would create instances of NumericRangeQuery from it using one of the static factories for each data type (for dates a conversion to long using Date.getTime() would be done). The datatype must be somehow predefined for the field names using some type of schema (per field).. Open ends use "*" and the [], (), {} would define if incl. NumericRangeQuery is a subclass of MultiTermQuery so the rewrite method also applies to this query. For NRQ there is also a config parameter precisionStep which default value is 4, but should be also configureable per-field together with the data type.

Example code for creating the NRQ are in the JavaDocs and there are 2 JUnits in trunk (TestNumericRangeQuery*) showing how it is used. Also the new LIA2 contains a chapter about it.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I think, this should be in 2.9. Any Chance to do this. In my Opinion, it should be not so hard. I will prepare something tomorrow.

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

I think, this should be in 2.9.

The standard way in the past was for the app to simply override getRangeQuery() to handle different fields differently. This still seems the most flexible.

asfimport commented 15 years ago

Luis Alves (migrated from JIRA)

You could still do something similar by simply override RangeQueryNodeBuilder.build(QueryNode queryNode), but this is not clean (it is kind of a hack).

A clean implementation would allow the user to configure the field types (which the "new flexible queryparser" does). I'm new to NumericRange Queries and Rangequeries in general, but here is what I think it should look like.

Here is a seudo java example:

    final String defaultField = "default";
    final String monthField = "month";
    final String hourField = "hour";
    final String distanceField = "distance";
    final String moneyField = "money";

    Map<CharSequence, RangeTools.Type> rangeTypes =  new HashMap<CharSequence, RangeTools.Type>();

    // set a field specific range type per field
    rangeTypes.put(monthField, new RangeTools.Type(RangeUtils.DATE, DateTools.Resolution.MONTH) );
    rangeTypes.put(hourField, new RangeUtils.Type(RangeUtils.DATE,  DateTools.Resolution.HOUR) );
    rangeTypes.put(distanceField, RangeUtils.getType(RangeUtils.NUMERIC,  RangeUtils.NumericType.LONG, NumericUtils.PRECISION_STEP_DEFAULT) );
    rangeTypes.put(moneyField, RangeUtils.getType(RangeUtils.NUMERIC,  RangeUtils.NumericType.Type.FLOAT, NumericUtils.PRECISION_STEP_DEFAULT) );

    StandardQueryParser qp = new StandardQueryParser();

    // set default range type to Int default precision
    qp.setDefaultRangeType(RangeUtils.getType(RangeUtils.NUMERIC,  RangeUtils.NumericType.INT, NumericUtils.PRECISION_STEP_DEFAULT));

    // set field range types
    qp.setRangeTypes(rangeTypes);

   Query q = qp.parser(" month:[01/01/2004 TO 01/01/2005]  distance:[1000 to 2000] money: [23.50 to 50.99]");

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

It feels like going that route would add much code and complexity.

If the user already knows how to create a range query in code, it's much more straightforward to just do

if ("money".equals(field)) return new NumericRangeQuery(field,...)
else return super.getRangeQuery(field,...)

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

You could still do something similar by simply override RangeQueryNodeBuilder.build(QueryNode queryNode), but this is not clean (it is kind of a hack).

What's the cleaner way to do this? EG could I make my own ParametricRangeQueryNodeProcessor, subclassing the current one in the "standard.processors" package, that overrides postProcessNode to do its own conversion?

asfimport commented 15 years ago

Luis Alves (migrated from JIRA)

Hi Yonik,

As I said before you can do that in the RangeQueryNodeBuilder.build(QueryNode queryNode), but it's ugly and this is not what we intended when using the "new flexible query parser".

The "new flexible query parser" does not follow the concept of method overwriting has the old one. So solutions that worked in the old queryparser, like overwriting a method, have to be implemented using a programmatic way.

Your approach requires creating a new class, overwrite a method. you still need to create a instance of your QueryParser and is not reusable.

Here is a sample of what your approach is:

Class YonikQueryParser extends QueryParser{

  Query getRangeQuery(field,...) {
    if ("money".equals(field)) return new NumericRangeQuery(field,...)
    else return super.getRangeQuery(field,...)
  }
}

...
 QueryParser yqp = new YonikQueryParser(...);
yqp.parser(query);

vs

What I am proposing:

    Map<CharSequence, RangeTools.Type> rangeTypes =  new HashMap<CharSequence, RangeTools.Type>();

    rangeTypes.put("money", RangeUtils.getType(RangeUtils.NUMERIC,  RangeUtils.NumericType.Type.FLOAT, NumericUtils.PRECISION_STEP_DEFAULT) );

    StandardQueryParser qp = new StandardQueryParser();
    qp.setRangeTypes(rangeTypes);

    qp.parser(query);

The second approach is programmatic does not require a new class, or the overwrite of a method and is reusable by other users, and it's backward compatible, meaning we can integrate this on the current "Flexible query parser" and deliver this feature on 2.9 without affecting any current usecase.

Your approach is not compatible, it does require new class, and is not programmatic, It's not reusable by other users (we can't commit your code to lucene), since fields are hard-coded.

Also the approach I proposing is very similar to setFieldsBoost setDateResolution, already available on the old QP and the new flexible query parser.

I also want to say, that extending the old QP vs extending the "New flexible Query Parser" approaches are never going to be similar, they completely different implementations.

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

It's not reusable by other users (we can't commit your code to lucene)

Neither is your version with rangeTypes.put("money", RangeUtils.getType(RangeUtils.NUMERIC... That's the application specific configuration code and doesn't need (or want) to be committed.

Directly instantiating the query you want is simple, ultimately configurable, and avoids adding a ton of unnecessary classes or methods that need to be kept in sync with everything that a user may want to do.

Is there a simple way to provide a custom QueryBuilder for range queries (or any other query type?) I'm sure there must be, but there are so many classes in the new QP, I'm having a little difficulty finding my way around.

asfimport commented 15 years ago

Luis Alves (migrated from JIRA)

What's the cleaner way to do this? EG could I make my own ParametricRangeQueryNodeProcessor, subclassing the current one in the "standard.processors" package, that overrides postProcessNode to do its own conversion?

For Yonik simple requirement, you could

Option 1 (more flexible):

make your own ParametricRangeQueryNodeProcessor, subclassing the current, returning NumericQueryNodes where needed
create a NumericQueryNode that extends RangeQueryNode (node extra code needed)
create a NumericQueryNodeBuilder that handles NumericQueryNodes, and set the map in StandardQueryTreeBuilder, ex: setBuilder(NumericQueryNode.class, new NumericQueryNodeBuilder()),. RangeQueryNodes will still be normally handled by the RangeQueryNodeBuilder.

Option 2, (less flexible):

make your own RangeQueryNodeBuilder subclassing the current(ex: NumericQueryNodeBuilder) , set the map in StandardQueryTreeBuilder, ex: setBuilder(RangeQueryNode.class, new NumericQueryNodeBuilder())

Option 1, implements the correct usage of the APIs. It's more flexible and "dirty work" is done in the processors pipeline. Option 2, is not the correct use case for the APIs, requires less code and it will work, but the builder will be performing the tasks the Processor should be doing.

asfimport commented 15 years ago

Luis Alves (migrated from JIRA)

Neither is your version with rangeTypes.put("money", RangeUtils.getType(RangeUtils.NUMERIC... That's the application specific configuration code and doesn't need (or want) to be committed.

You are correct, I was describing the use case from the user perspective. That code was a example how to use the API's if we implement them in the future, those API's are not currently available.

Directly instantiating the query you want is simple, ultimately configurable, and avoids adding a ton of unnecessary classes or methods that need to be kept in sync with everything that a user may want to do.

I'm not sure what to say here. So I'll point to the documentation that we currently have: You can read https://issues.apache.org/jira/secure/attachment/12410046/QueryParser_restructure_meetup_june2009_v2.pdf and the java docs for package org.apache.lucene.queryParser.core class org.apache.lucene.queryParser.standard.StandardQueryParser

You can also look at TestSpanQueryParserSimpleSample junit for another example how the API's can be used, in a completely different way.

The new QueryParser was designed to be extensible, allow the implementation of languages extensions or different languages, and have reusable components like the processors and builders

We use SyntaxParsers, Processors and Builders, all are replaceable components at runtime. Any user can build it's own pipeline and create new processors, builders, querynodes and integrate them with the existing ones to create the features they require.

Some of the features are:

Syntax Tree optimization
Syntax Tree expansion
Syntax Tree validation and error reporting
Tokenization and normalization of the query
Makes it easy to create extensions
Support for translation of error messages
Allows users to plug and play processors and builders, without having to modify lucene code.
Allow lucene users to implement features much faster
Allow users to change default behavior in a easy way without having to modify lucene code.

Is there a simple way to provide a custom QueryBuilder for range queries (or any other query type?) I'm sure there must be, but there are so many classes in the new QP, I'm having a little difficulty finding my way around.

Below is the java code for option 2. It's not the recomend way to use the new queryparser, but is the shortest way to do what you want.

  class NumericQueryNodeBuilder extends RangeQueryNodeBuilder {
    public TermRangeQuery build(QueryNode queryNode) throws QueryNodeException {
    RangeQueryNode rangeNode = (RangeQueryNode) queryNode;

    if (rangeNode.getField().toString().equals("money")) {
      // do whatever you need here with queryNode.
      return new NumericRangeQuery(field,...)
    }
    else {
        return super.build(queryNode);
      }
    }
  }

  public void testNewRangeQueryBuilder() throws Exception {    
    StandardQueryParser qp = new StandardQueryParser();
    QueryTreeBuilder builder = (QueryTreeBuilder)qp.getQueryBuilder();
    builder.setBuilder(RangeQueryNode.class, new NumericQueryNodeBuilder());

    String startDate = getLocalizedDate(2002, 1, 1, false);
    String endDate = getLocalizedDate(2002, 1, 4, false);    

    StandardAnalyzer oneStopAnalyzer = new StandardAnalyzer();
    qp.setAnalyzer(oneStopAnalyzer);

    Query a = qp.parse("date:[" + startDate + " TO " + endDate + "]", null);
    System.out.print(a);
  }

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

To go back to the idea why I opened the issue (and I think, this is also Mike's intention):

From what you see on java-user, where users asking questions about how to use Lucene: Most users are not aware of the fact, that they can create Query classes themselves. Most examplecode on the list is just: "I have such query string and I pass it to lucene and it does not work as exspected." It is hard to explain them, that they should simply not use a query parser for their queries and just instantiate the query classes directly. For such users it is even harder to customize this query parser.

My intention behind is: Make the RangeQueryNodeBuilder somehow configureable like Luis proposed, that you can set the type of a field (what we do not have in Lucene currently). If the type is undefined or explicite set to "string/term", create a TermRangeQuery. If it is set to any numeric type, create a NumericRangeQuery.newXxxRange(field,....).

The same can currently be done by the original Lucene query parser, but only for dates (and it is really a hack using this DateField class). I simply want to extend it that you can say: "this field is of type 'int' and create automatically the correct range query for it." Because the old query parser is now "deprecated", I want to do it for the new one. This would also be an intention for new users to throw away the old parser and use the new one, because it can be configured easily to create numeric ranges in addition to term ranges.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Given the complexity of customizing the new QueryParser, and given that numeric fields will likely be commonly used, I think this is an important issue. I think we should try to have the new QueryParser cleanly produce NumericRangeQuery, in 2.9.

EG expecting a user to do "option 1" (the "clean", more flexible option) is a tall order. Simple things should be simple...

The proposed RangeTools seems like a good approach, and I like how it cleanly absorbs the Date precisions that the old queryParser also supports.

But we better get cracking here since 2.9 is real close....!

Here's one side-question, about back compat promises for the new QueryParser: we are suggesting the users can start from all the building blocks in StandardQueryParser, and override the processors, create new nodes, builders, etc. with their own. But this is potentially dangerous, in that the next version of Lucene might change things up such that your custom code doesn't work anymore? It's alot like a core class being subclassed externally, and then change to the core class break those external subclasses.

EG say we had not handled numerics for 2.9, and users go and do "option 2" (the quick & dirty, but simplest, way to get NumericRangeQueries out). Then, say in 3.1 we implement the proposed fix here ("option 1"). Suddenly, we've altered what nodes come out of the processor pipeline, because we've created a new NumericRangeQuery node, and so the builders that users had added, for the RangeQuery node, will no loner be invoked. How are we going to handle back-compat here?

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Personally, I don't think we should deprecate the standard QueryParser yet - and the new one should carry no back compat policy. It needs to be flushed out in a release before we tell users to move to it IMO. Not enough Committers have enough experience with it to promise back compat at this point I think.

asfimport commented 15 years ago

Adriano Crestani (migrated from JIRA)

The proposed RangeTools seems like a good approach, and I like how it cleanly absorbs the Date precisions that the old queryParser also supports.

You meant DateTools, right?! I don't see so much difference to use this same approach over the "option1". You have a map based from field name to the DateTools.Resolution used for that field. Which is the same feature we want to implement on this JIRA, something you could configure how you are going to resolve the value defined on a range query based on the field name. The only difference is that we are expanding the options the user will have to resolve the values: RangeUtils.NUMERIC, RangeUtils.DATE, RangeUtils.FLOAT, etc...let me know if I missed or missunderstood something on this part.

Here's one side-question, about back compat promises for the new QueryParser: we are suggesting the users can start from all the building blocks in StandardQueryParser, and override the processors, create new nodes, builders, etc. with their own. But this is potentially dangerous, in that the next version of Lucene might change things up such that your custom code doesn't work anymore? It's alot like a core class being subclassed externally, and then change to the core class break those external subclasses.

EG say we had not handled numerics for 2.9, and users go and do "option 2" (the quick & dirty, but simplest, way to get NumericRangeQueries out). Then, say in 3.1 we implement the proposed fix here ("option 1"). Suddenly, we've altered what nodes come out of the processor pipeline, because we've created a new NumericRangeQuery node, and so the builders that users had added, for the RangeQuery node, will no loner be invoked. How are we going to handle back-compat here?

I think it's already happening with the "old" QP. It used to output RangeQuery objects and now it outputs TermRangeQuery objects. How is it going to be handled buy users expecting RangeQuery objects?

The "new" QP builder, delegates a query node based on its class to a builder, if there is no builder that knows how to build an object from that class it keeps looking up in the class hierarchy until it finds a builder that knows how to. Query nodes are supposed to be conceptual objects, they just represent some concept X, and ideally anything that fits in this concept should inherit from it, this way the user can create their own specific query nodes with no need to change how they are built (if there is no need for that). What I'm trying to say here is that if I create a node Y which extends X, I don't need to specify a new YBuilder for it, the XBuilder will be used. So, ideally, NumericRangeQueryNode should extends RangeQueryNode, the problem here is that we also need to specify a builder for the NumericRangeNode, and if the user sets a builder for RangeNode it will never be invoked for NumericRangeNode objects. Maybe it shouldn't at all, because if a new builder was specified for NumericRangeNode, it means a new kind of object should be built from it, something the user probably don't know yet, since it's a new kind of node, and his custom code needs to be updated anyway to support it.

Howerver, there is a solution for this kind of back-compat problem (which I don't think it is). In a future release, if a new XRangeQueryNode is created, instead of set

 luceneBuilderMap.setBuilder(RangeQueryNode.class, new RangeQueryNodeBuilder());
luceneBuilderMap.setBuilder(XRangeQueryNode.class, new XRangeQueryNodeBuilder());

We could do:

rangeBuilderMap.setBuilder(RangeQueryNode.class, new RangeQueryNodeBuilder());
rangeBuilderMap.setBuilder(XRangeQueryNode.class, new XRangeQueryNodeBuilder());

// then

luceneBuilderMap.setBuilder(RangeQueryNode.class, rangeBuilderMap);

This way, if the user reset the RangeQueryNode builder to its own builder, it will still be called for XRangeQueryNode and RangeQueryNode objects.

Let me know if there is any question about what I just described.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I would propose to absorb the RangeTools/Utils and DateTools/Utils (ehat is the correct name???) in one configuration class (just a bigger enumeration with a good name, not Utils/.Tools. e.g. RangeQueryDataType). By that you can define simply the type of a range query: term, numeric-int, numeric-float, numeric-double, date-precision-xxx,... Based on this enumeration, the upper/lower terms are parsed differently and different query objects are created. We just need to list all possible combinations of data types, the user could create: We could make this class extensible, if it is a Lucene Parameter class also supporting the parsing and building: One could simply create a new constant for his specific range type and supply methods to parse and build the query in the constant's implementation (so each constant contains also code to parse/build). I am not sure how to do this with the new parser. I think of the same like the MTQRewriteMethod (final static singletons in MTQ that do the rewrite and can be passed as parameter).

Maybe we can use this also to upgrade the old query parser if it gets not deprecated.

I think it's already happening with the "old" QP. It used to output RangeQuery objects and now it outputs TermRangeQuery objects. How is it going to be handled buy users expecting RangeQuery objects?

I was thinking about that, too. But here the API clearly defines, that getRangeQuery() returns a Query object without further specification. So the change was correct from the API/BW side. The change that another object is returned is documented in CHANGES.txt (as far as I know). We have here the same problem: You change the inner class implementations, but the abstract QueryParser's API is stable. The general contract when doing such things is, that you use instanceof checks before you try to cast some abstract return type to something specific, not documented.

You have the same in various factories also in the very bw-oriented JDK: XML factories create things like SAXParser and so on. If you cast the returned objects to some special implementation class, its your problem, because you remove the abstraction and work with implementations. This happened e.g. from the change between Java 1.4 to 1.5, when the internal SAX parsers were exchanged and their class names changed. A lot of programs broke by that, because the developers casted the objects returned from factories without instanceof checks.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I would propose to absorb the RangeTools/Utils and DateTools/Utils (ehat is the correct name???) in one configuration class

+1

Howerver, there is a solution for this kind of back-compat problem (which I don't think it is).

Actually, on reading your explanation I agree it's not really a back compat break, since the user's custom builder for RangeQueryNode would still be invoked, and the core's builder for NumericRangeQuery would handle the newly added numeric range support. I think this is reasonable.

asfimport commented 15 years ago

Adriano Crestani (migrated from JIRA)

I would propose to absorb the RangeTools/Utils and DateTools/Utils (ehat is the correct name???) in one configuration class

+1 this way is easier for the user to config

I was thinking about that, too. But here the API clearly defines, that getRangeQuery() returns a Query object without further specification. So the change was correct from the API/BW side. The change that another object is returned is documented in CHANGES.txt (as far as I know). We have here the same problem: You change the inner class implementations, but the abstract QueryParser's API is stable. The general contract when doing such things is, that you use instanceof checks before you try to cast some abstract return type to something specific, not documented.

Agreed, I also think it's fine as long as it's documented

asfimport commented 15 years ago

Luis Alves (migrated from JIRA)

I would propose to absorb the RangeTools/Utils and DateTools/Utils (ehat is the correct name???) in one configuration class

+1

I am not sure how to do this with the new parser. I think of the same like the MTQRewriteMethod (final static singletons in MTQ that do the rewrite and can be passed as parameter).

I think we probably should have TermRangeQueryNode a NumericRangeQueryNode and 2 builders classes that match that, change ParametricRangeQueryNodeProcessor to do the dirty work and create the new TermRangeQueryNode and NumericRangeQueryNode in the correct places, based on the a map with [field name,RangeTools.TYPE] or something similar. The builders should simple and just convert each type to the correct Lucene Object.

we should rename RangeQueryNode to TermRangeQueryNode (to match lucene name)
create the new NumericRangeQueryNode that extends from TermRangeQueryNode
change the ParametricRangeQueryNodeProcessor to read the configuration passed by the user and create the correct QueryNode objects.
create a new NumericRangeQueryNodeBuilder add it to the StandardQueryTreeBuilder mapping.

I hope this helps

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

If the existing query parser is not being deprecated, should this issue be pushed out to 3.0 or 3.1 to give it more time? In the meantime, people can use the existing override getRangeQuery() method. 2.9 is looking really close.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Finally read through this whole issue.

If the existing query parser is not being deprecated, should this issue be pushed out to 3.0 or 3.1 to give it more time? In the meantime, people can use the existing override getRangeQuery() method. 2.9 is looking really close.

+1 on pushing this. getRangeQuery() will still be first class.

It does seem like we should at least do this though:

we should rename RangeQueryNode to TermRangeQueryNode (to match lucene name)

asfimport commented 15 years ago

Luis Alves (migrated from JIRA)

If the existing query parser is not being deprecated, should this issue be pushed out to 3.0 or 3.1 to give it more time? In the meantime, people can use the existing override getRangeQuery() method. 2.9 is looking really close.

+1

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

we should rename RangeQueryNode to TermRangeQueryNode (to match lucene name)

I would not do this. RangeQueryNode is in the syntax tree and the syntax of numeric and term ranges is equal, so the query parser cannot know what type of query it is. When this issue is fixed 3.1, this node will use the configuration of data types for field names (date, numeric, term) to create the correct range query.

+1 on pushing this. getRangeQuery() will still be first class.

As noted in my comment on java-dev: We should add a comment in Javadocs, that the old (and also new) query parser do not work automatically with NumericRangeQuery, and that you should override getRangeQuery() and do a case-switch on the field name. I will do this later this day.

asfimport commented 15 years ago

Adriano Crestani (migrated from JIRA)

we should rename RangeQueryNode to TermRangeQueryNode (to match lucene name)

I would not do this. RangeQueryNode is in the syntax tree and the syntax of numeric and term ranges is equal, so the query parser cannot know what type of query it is. When this issue is fixed 3.1, this node will use the configuration of data types for field names (date, numeric, term) to create the correct range query.

I think it's ok to rename, as far as I know, the standard.parser.SyntaxParser generates ParametricRangeQueryNode from a range query, which has 2 ParametricQueryNode as child. So, the range processor, will need to convert the 2 ParametricQueryNode to the respective type, based on the user config: TermRangeQueryNode (renamed from RangeQueryNode) or NumericRangeQueryNode.

asfimport commented 13 years ago

Adriano Crestani (migrated from JIRA)

I think this is also a good candidate for GSoC 2011. I will add the labels to it.

Any comments?

asfimport commented 13 years ago

Adriano Crestani (migrated from JIRA)

Hi Uwe,

Are you willing to mentor this project on GSoC? If you are, I will keep it assigned to you, otherwise let me know so I assign it to me ;)

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I can help!

asfimport commented 13 years ago

Vinicius Barros (migrated from JIRA)

Hi Uwe and Adriano,

I read the description, javadocs about the numeric support and the new query parser and all the comments here. Let me try to summarize what needs to be done here:

figure out a way to configure in query parser which fields are numeric, and for each of these fields, the numeric type must also be defined...from the comments, it seems the best way to do this is using a map field>NumericType

-create a NumericQueryNodeProcessor that converts ParametricRangeQueryNode to NumericRangeQueryNode, when its field is a numeric field. This processor should also convert the range string values to numeric values based on the NumericType

-create a NumericRangeQueryNodeBuilder, which will build which will build NumericRangeQuery objects from NumericRangeQueryNode objects

-rename RangeQueryNode to TermRangeQueryNode as it will only be used for string

-create a NumericRangeQueryNode which will be used for any non-string range query

-merge DateTools with a new NumericTools class. Does that make sense? I am not sure if I got everything correctly here.

Some questions below:

Luis: create the new NumericRangeQueryNode that extends from TermRangeQueryNode

-should NumericRangeQueryNode extends TermRangeQueryNode? I don't see any reason for that, since one will hold Number values and the other String values

-I remember the old date query, using strings, used to not only allow range queries, but also term queries (date:2010/10/10), is that correct? Does numeric fields also support this kind of query? I could only fine NumericRangeQuery, but no NumericQuery. If the user enters (age:19) in the query, and "age" is a numeric field, should the query parser throw an error saying it's not suppported?

I am planning to create a GSOC proposal for this project, it looks interesting, very cool this new support to numeric in Lucene, I missed that first time I used Lucene, maybe because I was used to regular databases. Also, the query parser uses some design patterns I have been reading about lately, as builders and processors.

asfimport commented 13 years ago

Adriano Crestani (migrated from JIRA)

Hi Vinicius,

Nice summary! There is a formatting problem, but I think people can understand it.

I had to re-read all the comments, it took sometime, it seems you got all the main points in the summary.

-should NumericRangeQueryNode extends TermRangeQueryNode? I don't see any reason for that, since one will hold Number values and the other String values

You are right, it does not make sense to one extend the other. However, I think they should have a common parent (e.g. <interface> RangeQueryNode), that will have common methods like QueryNode getLowRange().

I will let Uwe answer the other questions, I am curious to know the answers too :)

The student proposal period has started, so go ahead and start drafting it ;)

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I remember the old date query, using strings, used to not only allow range queries, but also term queries (date:2010/10/10), is that correct? Does numeric fields also support this kind of query? I could only fine NumericRangeQuery, but no NumericQuery. If the user enters (age:19) in the query, and "age" is a numeric field, should the query parser throw an error saying it's not suppported?

To create something like a NumericQuery (age:19), the correct and most performant way is to use a NumericRangeQuery with (includeLower==includeUpper)==true and (lower==upper)==value. This query is since Lucene 2.9 always rewritten in the most optimal way (internally it uses a ConstantScore TermQuery using the prefix encoded term. This should also be noted in JavaDocs?

This is also how Solr's QP works.

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

-merge DateTools with a new NumericTools class. Does that make sense? I am not sure if I got everything correctly here.

Do you mean Lucene'Core's org.apache.lucene.document.DateTools? Because this class is somehow deprecated (not officially) but all those tool classes should not be used together with NRQ. Or does the new QP has its own tools classes?

asfimport commented 13 years ago

Nikola Tankovic (migrated from JIRA)

Hi folks,

I'm PhD student from Croatia willing to participate in GSoC this year. I work in Croatian firm called Superius Ltd on software modelling based on graph database (Neo4J to be concrete). This issue here sounds like a nice addition to Lucene that would help us also make queries that are needed over our business data contained in graph (indexed by Lucene). I was wondering whether is this project still open for GSoC or already assigned to someone?

Thank you!

asfimport commented 13 years ago

Vinicius Barros (migrated from JIRA)

Hi,

Thanks Uwe and Adriano,

I finally finished and submitted my proposal to this project, please, take a look and tell me if I need to change something. My linkid is viniciusbarros

Sorry for taking so long to submit it, but just got a free time this weekend, college stuffs are keeping me busy.

Uwe: I added to my proposal the idea of enabling the user to enter a query that searches for a single numeric value, example, age:19.

About DateTools, I think this can be decided later, in the end it's just a class with some format options the user may choose. Anyway, the numeric already have a pre-defined form to format number in strings before indexing right?! Take a look to what I have defined in my proposal, where I allow the user to specify a Format object, which is used by the query parser to parse the string value entered by the user to Number. Please, let me know if I am not getting something here.

asfimport commented 13 years ago

Adriano Crestani (migrated from JIRA)

Hi Nikola,

That's great you are interested in submit a proposal for gsoc this year. Proposal submissions are still open until next 8th (check GSOC timeline at the website). It means you can submit a proposal to any project or even suggest your own project (in this case the community will need to accept it as a gsoc project).

There is already a proposal for this project, but feel free to submit another one. To raise your chances of getting into gsoc this year, I would suggest you to apply to a project with no candidates yet, check the Lucene projects here: https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=labels+%3D+lucene-gsoc-11

Good luck! ;)

asfimport commented 13 years ago

Adriano Crestani (migrated from JIRA)

Hi Vinicius,

Your proposal looks good, details everything you intend to do and the proposed solutions looks good to me, include what the community has previously discussed.

+1 for adding support for simple numeric queries as age:19

One thing I would suggest you to change is to make it clearer the query parser you are intending to change is the contrib query parser, to be more specific the standard implementation. You just mention it only once in the entire proposal!

asfimport commented 13 years ago

Vinicius Barros (migrated from JIRA)

Thanks for reviewing it Adriano. I updated the proposal to clarify it's the contrib query parser.

asfimport commented 13 years ago

Vinicius Barros (migrated from JIRA)

This patch includes the work I did this first week. I started with one of the project's objective: restructure RangeQueryNode and its related classes to support number and text range queries.

I created some querynode interfaces, such as ValueQueryNode that abstract the value a leaf node may hold, since now, leaf nodes do not only hold text anymore, but also number values.

Let me know if you have any questions or any suggestions about the code.

I expect I created the patch correctly, as it's the first time I play with subversion

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi Vinicius,

thanks for your update! The patch looks correct, I assume it's for Lucene trunk? You should produce them against the top-level directory (below trunk/), not the lucene sub-directory (since Lucene was merged with Solr last year).

I have not closely looked at the code (came back from California recently), but your refactoring as a first step looks fine. I would only suggest to never depend on the default locale (Robert Muir will tell you the same thing), so it should respect the local given to query parser.

I will report back, when I had time to look into it, but it looks really fine - also from the Generics Policeman Point of View g

asfimport commented 13 years ago

Vinicius Barros (migrated from JIRA)

Hi Uwe,

Thanks for quickly reviewing the patch. Yes, I am using trunk's code. I will do the changes you suggested and include in the next patch.

asfimport commented 13 years ago

Vinicius Barros (migrated from JIRA)

This is the patch for my second week of work.

asfimport commented 13 years ago

Vinicius Barros (migrated from JIRA)

The second patch includes the week1 changes plus: implemented classes to support numeric configuration in query parser

I was not sure what to do about the locale. The locale is required by EscapeQuerySyntax.escape, which seems to escape characters so they don't mix with query parser's operators. The code I used locale I copied from FieldQueryNode, which uses the escaper and passes the default locale, however, other nodes as RegexpQueryNode ignore the escaper and just return the plain text. I was not sure what to do, then I am forcing the locale to ENGLISH now.

I also took a long time to figure out how to implement the numeric configuration, it seemed to me the best approach was to copy the way FieldBoostAttribute is configured. It's complex, but it's the only way I found without doing any ugly workaround.

Please, take a look at the code and give me some suggestions in case you thing I need to change something.

PS: the patch is now created from the trunk folder, as Uwe suggested

Thanks!

asfimport commented 13 years ago

Vinicius Barros (migrated from JIRA)

ah, one more thing. Uwe, what is "Generics Policeman Point of View"?

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

ah, one more thing. Uwe, what is "Generics Policeman Point of View"?

That's just my nickname, because I always watch correct usage of Java 5 Generics :-) I just wanted to confirm, that your generification of some classes looked fine.

I will revisit your patch tomorrow together with others here at BerlinBuzzword (a conference about Lucene and other NoSQL related stuff).

Uwe

asfimport commented 13 years ago

Adriano Crestani (migrated from JIRA)

Vinicius: you are right, contrib QP configuration is very complicated, that's why Phillip is working on another GSOC project to make it simpler. So don't worry much about the best way to use the config API, since it will change, just make it work for now using the old API ;). You have a good point when you mentioned the escaper problem with Locale. I should think more about it...

Uwe: Are you intending to commit the patch only at the end of gsoc? Just wondering, since Vinicius is not selecting the ASF checkbox when submitting the patch, which means the current patches will not be able to be committed.

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi Vinicius,

if you want the code be committed later, you should check the license box ("Grant license to ASF for inclusion in ASF works (as per the Apache License §5)"), else we will be not able to submit it to the main repository.

If you want us to commit the patch only at the end of GSOC, it's enough to check this box in your final submission, but it should be noted, that we may submit minor parts of the work even before (once you are at a state where it is 'useable' and passes existing tests). A second commit could e.g. adding sophisticated tests, and so on.

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

One small thing I have seen after applying your patch: The code guidelines of Lucene require no TABS but two whitespace to indent. We have a code style available for Eclipse and IDEA in the dev-tools folder (below trunk). You only have to install it.

Also you are using Java 6 interface overrides, so the code does not compile with Java 5 (unfortunately this is a bug in Java 6's javac, as it does not complain when in "-source 1.5" mode). In Java 5 compatible code it is not allowed to add @Override to methods implemented for interfaces:

common.compile-core:
    [mkdir] Created dir: C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr2\lucene\build\contrib\queryparser\classes\java
    [javac] Compiling 175 source files to C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr2\lucene\build\contrib\queryparser\classes\java
    [javac] C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr2\lucene\contrib\queryparser\src\java\org\apache\lucene\queryParser\core\nodes\FieldQueryNode.java:182: method does not override a method from its superclass
    [javac]     `@Override`
    [javac]          ^
    [javac] C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr2\lucene\contrib\queryparser\src\java\org\apache\lucene\queryParser\core\nodes\FieldQueryNode.java:187: method does not override a method from its superclass
    [javac]     `@Override`
    [javac]          ^
    [javac] C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr2\lucene\contrib\queryparser\src\java\org\apache\lucene\queryParser\standard\config\NumericFieldConfigListener.java:21: method does not override a method from its superclass
    [javac]     `@Override`
    [javac]          ^
    [javac] C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr2\lucene\contrib\queryparser\src\java\org\apache\lucene\queryParser\standard\nodes\AbstractRangeQueryNode.java:17: method does not override a method from its superclass
    [javac]     `@Override`
    [javac]          ^
    [javac] C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr2\lucene\contrib\queryparser\src\java\org\apache\lucene\queryParser\standard\nodes\AbstractRangeQueryNode.java:32: method does not override a method from its superclass
    [javac]     `@Override`
    [javac]          ^
    [javac] C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr2\lucene\contrib\queryparser\src\java\org\apache\lucene\queryParser\standard\nodes\AbstractRangeQueryNode.java:79: method does not override a method from its superclass
    [javac]     `@Override`
    [javac]          ^
    [javac] C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr2\lucene\contrib\queryparser\src\java\org\apache\lucene\queryParser\standard\nodes\NumericQueryNode.java:20: method does not override a method from its superclass
    [javac]     `@Override`
    [javac]          ^
    [javac] C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr2\lucene\contrib\queryparser\src\java\org\apache\lucene\queryParser\standard\nodes\NumericQueryNode.java:25: method does not override a method from its superclass
    [javac]     `@Override`
    [javac]          ^
    [javac] C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr2\lucene\contrib\queryparser\src\java\org\apache\lucene\queryParser\standard\nodes\NumericQueryNode.java:35: method does not override a method from its superclass
    [javac]     `@Override`
    [javac]          ^
    [javac] C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr2\lucene\contrib\queryparser\src\java\org\apache\lucene\queryParser\standard\nodes\NumericQueryNode.java:52: method does not override a method from its superclass
    [javac]     `@Override`
    [javac]          ^
    [javac] C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr2\lucene\contrib\queryparser\src\java\org\apache\lucene\queryParser\standard\nodes\NumericQueryNode.java:57: method does not override a method from its superclass
    [javac]     `@Override`
    [javac]          ^
    [javac] C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr2\lucene\contrib\queryparser\src\java\org\apache\lucene\queryParser\standard\parser\JavaCharStream.java:367: warning: [dep-ann] deprecated name isnt annotated with `@Deprecated`
    [javac]   public int getEndColumn() {
    [javac]              ^
    [javac] C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr2\lucene\contrib\queryparser\src\java\org\apache\lucene\queryParser\surround\parser\CharStream.java:34: warning: [dep-ann] deprecated name isnt annotated with `@Deprecated`
    [javac]   int getColumn();
    [javac]       ^
    [javac] C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr2\lucene\contrib\queryparser\src\java\org\apache\lucene\queryParser\surround\parser\CharStream.java:41: warning: [dep-ann] deprecated name isnt annotated with `@Deprecated`
    [javac]   int getLine();
    [javac]       ^
    [javac] Note: Some input files use or override a deprecated API.
    [javac] Note: Recompile with -Xlint:deprecation for details.
    [javac] 11 errors
    [javac] 3 warnings

With Java 6 the code compiles, but some tests fail to work. I assume its simply because of the work-in-progress,

asfimport commented 13 years ago

Vinicius Barros (migrated from JIRA)

Hi Uwe,

Thanks for reviewing the patch again. I will fix the problems you mentioned.

I do not think the code is ready to be committed, I am just sending the patches so you can keep track of my progress. I hope to have something useable soon, then you can commit, probably before the end of gsoc.

asfimport commented 13 years ago

Adriano Crestani (migrated from JIRA)

Hi Vinicius,

Assuming you are using eclipse, you can find the codestyle used to create lucene code at the bottom of this page: http://wiki.apache.org/lucene-java/HowToContribute, it will fix the identation problem Uwe mentioned.

After reviewing the code, I want to remember you that all files must have the ASF header. Take a look at the other Java classes in Lucene repository so you can have an example.

The way you are organizing the code looks good to me, just make sure whenever you add a new class to contrib query parser, place it under the right package. "core", if the class is generic and might be used by other queryparser implemenations; "standard", if the class is specific to lucene standard query parser implementation.

asfimport commented 13 years ago

Adriano Crestani (migrated from JIRA)

One more thing I forgot to mention, when creating new QueryNodes, try to enforce the user when using the constructor to pass the required arguments. For example: NumericQueryNode does not have any constructor, I would suggest you to change it to NumericQueryNode(CharSequence field, Number number, NumberFormat format).

asfimport commented 13 years ago

Vinicius Barros (migrated from JIRA)

This patch includes:

-changes suggested by Adriano and Uwe: -remvoed @Override -applied Lucene codestyle -created constructor to NumericQueryNode -added ASF header to new classes -Implemented new NumericQueryNodeProcessor and NumericRangeQueryNodeProcessor

apache / lucene

NumericRange support for new query parser [LUCENE-1768] #2842

2641

2898

4411

4425