apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.69k stars 1.04k forks source link

Make all query classes serializable, and provide a query parser to consume them [LUCENE-4012] #5085

Open asfimport opened 12 years ago

asfimport commented 12 years ago

I started off on #5077 wanting to use DisjunctionMaxQuery via a parser. However, this wasn't really because I thought that human beans should be improvisationally composing such thing. My real goal was to concoct a query tree over here, and then serialize it to send to Solr over there.

It occurs to me that if the Xml parser is pretty good for this, JSON would be better. It further occurs to me that the query classes may already all work with Jackson, and, if they don't, the required tweaks will be quite small. By allowing Jackson to write out class names as needed, you get the ability to serialize any query, so long as the other side has the classes in class path. A trifle verbose, but not as verbose as XML, and furthermore squishable (though not in a URL) via SMILE or BSON.

So, the goal of this JIRA is to accumulate tweaks to the query classes to make them more 'bean pattern'. An alternative would be Jackson annotations. However, I suspect that folks would be happier to minimize the level of coupling here; in the extreme, the trivial parser could live in contrib if no one wants a dependency, even optional, on Jackson itself.


Migrated from LUCENE-4012 by Benson Margulies (@bimargulies-google), 1 vote, updated May 20 2019 Attachments: bq.patch

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

My first question is, what about backward compatibility requirements? The problem is if people start using such a structure ('remote api') that depends upon the structure of our java source code, then they will be upset if we break it.

If we totally change a Query's API, does that push all the responsibility of the API designer to deal with serialization backwards compat? APIs are difficult enough as-is just for java consumers.

This seems similar to the java serialization issue (which we removed for this reason). Can't the serialization be totally independent?

asfimport commented 12 years ago

Benson Margulies (@bimargulies-google) (migrated from JIRA)

Now you've got me thinking. Several points:

First, Jackson is far, far, less implementation-sensitive than the built-in Java serialization. It looks at the public API of constructors, getters, and setters. How likely are you to change the API in a way that invalidates the idea that a TermQuery consists of a term name and a value?

Second, this is not about long-term persistence. I'd be happy to document it as 'do not expect to move one of these across versions.' It's most obvious application is to move arbitrary queries around in SolrCloud.

However, thirdly, you want it independent? I can make it independent. Here's how. Let's assume that you are willing to state the same sort of invariants for a query object that are embodied in parser support. Once you make a parser for something, you've promised that it has (at least) a particular set of inputs until the next major version, yes?

So, for each Query object class, make a Jackson mixin class, that maps the 'official inputs' to the current collection of setters/constructor arguments. Want to re-wrangle the query classes? If you actually change the constructors/setters that you would use in a query parser for a particular query, you then have to change the mixin class.

If it were up to me, I'd start by annotating the actually classes themselves with Jackson @nnotations for this purpose, and only create mixins when and if an incompatible change happened, but I'm not really opinionated.

Would it be sensible to start to concoct a contrib module with all the jackson in it, accompanied by benign 'add a few getters and setters' classes to the queries?

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

First, Jackson is far, far, less implementation-sensitive than the built-in Java serialization. It looks at the public API of constructors, getters, and setters. How likely are you to change the API in a way that invalidates the idea that a TermQuery consists of a term name and a value?

TermQuery currently has a bogus constructor today:

/** Expert: constructs a TermQuery that will use the
  *  provided docFreq instead of looking up the docFreq
  *  against the searcher. */
public TermQuery(Term t, int docFreq) {

Its bogus because some scoring models may not use docFreq at all (e.g. language modelling): I know why it exists (for one contrib query), I think we have a plan to fix it, but its just not yet done... like many things :)

But thats an example of one I think is fair to remove in a minor release: besides being bogus, its documented as Expert, so its fair game.

I'm just bringing this up as a counterexample... I thought serialization was harmless before myself until I tried to make useful changes (like yanking vector-space model out of the scoring system) and was totally blocked by serialization. So I'm gonna be suspicious of the word no matter what :)

If it were up to me, I'd start by annotating the actually classes themselves with Jackson @nnotations for this purpose, and only create mixins when and if an incompatible change happened, but I'm not really opinionated.

Are these annotations in java 6 API? Annotations are no different than other pieces of code, they are an additional responsibility. As far as @Deprecated, we fail the build if we add @deprecated javadocs but forget it, as far as @Override, well nothing automated yet, but we should be somehow enforcing that as well. Besides the fact we don't want to add any dependencies to the lucene core, how would we know such annotations were correct?

Would it be sensible to start to concoct a contrib module with all the jackson in it, accompanied by benign 'add a few getters and setters' classes to the queries?

Don't let me get in your way: I'm just the devil's advocate when i hear serialization. As far as adding getters and setters to queries I'm not sure if they are benign or not without us looking at them. For example, AutomatonQuery and all of its subclasses: Wildcard, Regexp, etc are totally immutable once created. This is for good reasons so I don't think we should add any setters to these classes, and the stuff it stores internally shouldn't be accessible via a getter.

asfimport commented 12 years ago

Benson Margulies (@bimargulies-google) (migrated from JIRA)

Rob,

These are not Java annotations, they are specific to Jackson. (Though Jackson will pay attention to the JAX-B annotations in Java 6, I fear that this is a cure worse than the disease.) So hanging them on the real classes makes for a hard dependency on Jackson to compile, and I'm not sure myself that this is a good idea in the middle of Lucene. If I stick to the mixin approach, no annotation are needed at all.

As for the rest, I think that we agree. The point of the annotations is to only serialize the stuff that makes up the plausible, stable, sort of attributes that you would support in a parser syntax. (though I suppose someone might want to SolrCloudify some really expert complexity), and the selective annotation avoids the rest.

So I'm going to sketch something out and we'll all see what it looks like.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Yeah a sketch (maybe just Term and Boolean or something ?) would be cool, maybe my paranoia is unjustified... then we could see what it actually would need to look like and think about the backwards-compatibility/API costs etc would be.

asfimport commented 12 years ago

Benson Margulies (@bimargulies-google) (migrated from JIRA)

Add accessor to BooleanQuery.java to allow serializing 'disable-coord'.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

My only comment on that: it would help perpetuate this inverted 'disable/omit' stuff that i just cant stand. I really wish BQ had 'enableCoord' in the ctor with true as the default: of course a horrible thing to try to fix but we were able to fix this for omitTF etc so I'm not yet losing hope. :)

asfimport commented 12 years ago

Benson Margulies (@bimargulies-google) (migrated from JIRA)

git://github.com/bimargulies/lucene-json-qp.git

Rob, here you will find the promised look at how this works. No QP yet, that's the easy part.

Apologies for the Maven project, but that's what I do quickly.

No patches to core required at all yet. if you run the unit test, you will see what the json looks like. Concise, no surprise, it's not.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

The query integration at a glance looks very non-invasive from my perspective!

I think we should pursue this? We might run into some tricky parts but we could have this component as a separate pluggable module with probably minimal changes to the core Query apis right?

asfimport commented 12 years ago

Benson Margulies (@bimargulies-google) (migrated from JIRA)

Right. The tiny patch is an example of the sort of change, and I think it's always good to have 'getters'.

In fact, I don't even need that patch for the code to work, but i thought it was a good idea.

I'll continue to set this up. Let me know when/how/if you want it shaped to go into the tree somewhere.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think if it works for you, just iterate in your github and ping the issue when you make progress? Otherwise we worry too much about where/how the code sits when that isn't really so important at this stage.

As far as final integration, I think there are a number of ways to do it but its really unrelated to making progress here. One suggestion might be to split queryparser/ module like analyzers/ so we have:

This could probably make things less confusing, as currently queryparser/ is a mix of different frameworks with different dependencies (e.g. xml depends on queries/ and sandbox/, but the others dont, and json will depend on jackson and maybe other stuff, etc, etc).

And then finally, probably a followup issue to do solr integration (i'm more fuzzy on that).

asfimport commented 5 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

I want to hijack this issue to be about maing Query serializable by any means necessary. The idea of using Jackson seemed like it could be problematic since it tends to expose implementation details (constructor signatures, eg), but the idea of query serialization is powerful, and we should have it in our bag of tricks. A whole class of optimizations stems from analysis of query logs, and in order to treat queries as data we need a persistent form for them (not just in-memory java Query objects).

It seems like we have a good angle of attack since #4114 landed, adding a QueryVisitor. My thought is that each query parser could potentially come with a serializer that serializes queries into its language, since not every parser can represent every query type. Or maybe XML query parser is truly general and handles everything thus there is no need for any other flavor? I'm not sure though I seem to recall it has some gaps as well.

I worked up a POC that serializes combinations of Boolean and TermQuery into a form that is parseable by classic query parser, and I think it can be extended pretty easily to cover most query types. I have a question here: to get it to work it seemed as if I needed to make BooleanQuery.visit call getSubVisitor for every clause (rather than once for each occur-value). This broke a single test though in TestQueryVisitor that asserts something about the sequence of these calls, and I'm not sure if that assertion is an invariant of the QueryVisitor contract, or whether it is simply a byproduct of the implementation. @romseygeek can you shed some light? I can post a WIP PR if that would help clarify.

asfimport commented 5 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

My thought is that each query parser could potentially come with a serializer that serializes queries into its language, since not every parser can represent every query type

FYI it's not only about the query types that a parser supports. For instead the classic query parsers creates synonym queries for terms that occur at the same position, but it doesn't have a way to create a synonym query explicitly from 2 or more terms.

I agree that serialization of queries is important, but in my opinion it makes things simpler to add serialization support to another, simpler, query layer that would sit on top of Lucene and that only translates into Lucene queries for the purpose of query execution. This layer doesn't have to care of the details regarding how scores are merged across fields (dismax vs. bool vs. BM25FQuery), whether any query terms included synonyms, how the field was indexed (range query on points or terms don't translate to the same Lucene query), whether some queries were introduced that only affect execution paths but not matched results (IndexOrDocValueQuery), etc.