apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.64k stars 1.02k forks source link

Add State To QueryVisitor [LUCENE-8882] #9925

Open asfimport opened 5 years ago

asfimport commented 5 years ago

QueryVisitor has no state passed in either up or down recursion. This limits the width of decisions that can be taken by visitation of QueryVisitor. For eg, for #9924, we need a way to specify is the visitor is a rewriter visitor.

 

This Jira proposes adding a property bag model to QueryVisitor, which can then be referred to by the Query instance being visited by QueryVisitor.


Migrated from LUCENE-8882 by Atri Sharma (@atris), updated Jul 02 2019

asfimport commented 5 years ago

Atri Sharma (@atris) (migrated from JIRA)

I think this is useful even outside #9924 – This allows upper queries to collect metadata about the lower leaf level queries and make decisions (motivated by the excellent work done recently to use the property of a sorted index to perform binary searches on docIDs). So we could use a property such as INDEX_SORTED, which is populated at some query and visible to the entire query tree, and then a query looks at the property and decides to use a specific type of query. This can even be ingested in the cost of the query, but in a localised form so that not all heuristics are crammed in one specialized query (IndexOrDocValues?)

 

Objections/Thoughts/Comments?

asfimport commented 5 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Can you elaborate more on how this would help replace IndexOrDocValues?

asfimport commented 5 years ago

Atri Sharma (@atris) (migrated from JIRA)

My idea was not to replace IndexOrDocValues, but to allow it to be more generally applicable.

 

For eg, taking the specific example of the optimized query which is applicable for limited cases in which the index is sorted, we would ideally be better off if we used that query over point values (even though that query is a docvalues based implementation). However, the query is too specialized for IndexOrDocValues to factor in.

 

What I was envisioning was a state where, at the start of the query, IndexSearcher creates a QueryVisitor, sees that the index is sorted by key X, and populates a property in the QueryVisitor's metadata (INDEX_SORTED_KEY=X).

 

IndexOrDocValuesQuery, then, instead of making an immediate decision as to whether to use Points or DocValues, passes on the visitor to both of the branches. Further down the line, the sorted index query type will see the metadata in the visitor and volunteer itself (by adding another property in the metadata of the visitor (SORTED_PLAN_AVAILABLE=true or something).

 

In the end, IndexOrDocValues will perform an evaluation, which includes the costing which it does today + the metadata state gathered from both the branches, and then choose the branch to execute. This will allow new query types for specific use cases to be added easily (just add a new property type and a listener query for it), and let the engine take better decisions as to when to execute what queries, which can potentially lead to better query performance.

 

Thoughts?