apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.47k stars 980 forks source link

Make boolean query clause limit configurable per-query [LUCENE-7880] #8931

Open asfimport opened 7 years ago

asfimport commented 7 years ago

As we know, the magic BooleanQuery.maxClauseCount has bitten many people over time. It's also a static, which really hurts multi-tenancy (i.e. we can't have different settings for different users, clients, or use-cases).

If we want to keep this static as a default, then at least we should allow it to be overridden on a per-query basis when we know it is the desired behavior and not a bug.

Perhaps the simplest way to achieve this would be a setter on BooleanQuery.Builder that configures the limit for that instance only?


Migrated from LUCENE-7880 by Yonik Seeley (@yonik), updated Jun 20 2017 Linked issues:

asfimport commented 7 years ago

David Smiley (@dsmiley) (migrated from JIRA)

+1 fine idea. With this new proposed option, BooleanQuery.maxClauseCount could still exist but only as a default for when you don't go out of your way to set the max on the Builder.

asfimport commented 7 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Anyone object to the idea of a getter/setter on BooleanQuery.Builder? If not, I can work up a patch.

asfimport commented 7 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I think of the max clause count as something that should never be changed. What is the use-case for having more than 1024 clauses? It sounds like the user is using a BooleanQuery when he should actually use a TermInSetQuery? I am not a fan of adding a per-BooleanQuery maxClauseCount. Moreover it feels like an unnecessary API to me: users can already check the number of clauses by themselves, which is not harder than calling a setter?

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

+1

My first reaction was that burdening the query writer with changing this parameter would be pretty trappy. Then I realized that we could make it a default or invariant either for all request handlers or individual ones. Individual request handlers could even have different values, right?

If that's true, then I'd favor deprecating (and eventually removing support in 8.0) the naked <maxBooleanClauses> from solrconfig. Move it to an initParams or just leave it out of the config completely and have it default in the code to whatever we want if not specified on the query.

Really though any solution that got us around the fact that "last solrconfig maxBooleanClauses loaded wins" trap works for me. +1 to this solution b/c it gives us quite a bit of flexibility without making things trappy.

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Adrien:

We simply cannot create arbitrary limits and say "you should learn to do a better search and not need to ever exceed X" because we cannot anticipate the "meta issues" Solr or Lucene users have to deal with.

Here are some use-cases I've seen in the wild:

I have seen very complex queries in legal situations that run to 20K + raw input before rewrite. Ditto pharma applications. If you're investing 750M in drug development you want to be very, very, very sure that there is no prior art you'd violate. These organizations pay people who create amazingly complex queries and mix-n-match them to insure that they're not infringing on a patent.

I've seen these organizations create 4K reusable sub-queries that they then combine in various ways to accomplish their goals. A dozen at a time. Or more.

It gets worse in the legal arena. I can and have said from an IR standpoint, why are you searching for (runn OR runni OR runnin OR running) Why not just stem? Ans: Because I can defend this search. I cannot explain in court that "we did (or didn't) find document X because the search engine we're using uses a stemming algorithm that did (or didn't) match". And there may be hundreds and hundreds of such clauses.

Genome matching. Believe it or not I've worked with people using Solr/Lucene for matching genome sequences. The complexity here is beyond belief. Machine generated to be sure.

These kinds of domains often do not care how long a search takes. Well, not quite true but for some applications "10 minutes? Great!". I once worked with an organization that had, prior to Solr, a 48 hour turn-around and were happy with anything less than an hour and could live with longer ones upon occasion. Not very many users, true. So my reflexive reaction of "well, these will be expensive queries, it may take some time to return" is invalid.

So you see why I'll -1 imposing limits that aren't absolutely necessary every time.

asfimport commented 7 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

This is not the only arbitrary limit we have. For instance terms can't be greater than 32KB and you can only use edit distances of 1 or 2 with FuzzyQuery.

In some of the cases you mentioned, it seems to me that users should be using TermInSetQuery instead. Maybe a few of them actually need scoring, which is only provided by BooleanQuery. However, to me those are corner cases, and I don't like the idea of extending the API of an important class like BooleanQuery just to help deal with corner-cases.

To me the best option is to keep things as they are today: let users know when they are using BooleanQuery in ways that won't perform well, yet still allow a minority of corner-cases that need greater limits to increase the max clause count through the static setter.

asfimport commented 7 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

users can already check the number of clauses by themselves

That only works to lower the limit, not raise it. But it is more straightforward to raise the static limit and do checking at a higher level, which allows for choices to be made on context. I've proposed this in SOLR-4586

asfimport commented 7 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

This is not the only arbitrary limit we have. For instance terms can't be greater than 32KB and you can only use edit distances of 1 or 2 with FuzzyQuery.

At least some of these upper limits are certainly not arbitrary (and hence not configurable). AFAIK, simply increasing these numbers would not result in working code, hence the limits represent real technical limitations. IIRC, the history behind the 32KB is that the length was packed in an java char (16 bits).

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

bq: This is not the only arbitrary limit we have

This is not a justification to impose limits where there's no technical reason to. And imposing a limit on maxbooleanclauses that cannot be changed would take away capabilities that at least some users currently rely on.

bq: In some of the cases you mentioned, it seems to me that users should be using TermInSetQuery instead

Sure. I never claimed that there weren't ways around some/all of the individual cases. My point is there's case N+1 that we haven't thought of and arbitrary, unnecessary limits do not serve the end users well. And taking away current capabilities requires a very good technical reason in my view.

Oh, and I didn't even mention that the queries can be distributed across N fields by edismax.

Yes, yes, yes. That reflects sub-optimal design. User's don't care. Once the performance/functionality is "good enough" in their eyes, whether it could be made better is mostly irrelevant. And forcing them deep into the code in order to do what they're doing now when there's no technical reason to simply creates another barrier they have to overcome. Premature optimization and all that.

asfimport commented 7 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I'm all for trappy limits that pop up out of the blue because devs know better than me about my resources and use cases and where stuff should just start failing for my own good with runtime exceptions.. I've been meaning to make sure you can only have 5 segments before exceptions are thrown as well. A lot of performance seems to drop off that and users might be get confused without their app breaking at 5. Once I sneak that in, I intend to veto lock it till 2035 when hardware might be up to the task of tearing me from my death veto grip, as which time 6 segments can be the new dev knows when to exception screw you limit.

No wait. I take it back. That was the paranoid inducing vodka talking.

asfimport commented 7 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Mark, what is your proposal?

This issue looks like a tentative to workaround the fact that there was disagreement about removing the limit entirely in #5900. However adding a per-query API feels even worse to me. It adds an unnecessary method to one of the main classes of our API, and will contaminate many other classes like query parsers, maybe MoreLikeThis, etc. if we want to be able to set different limits on boolean queries that these classes generate too.

asfimport commented 7 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

First, lets assume that having this global limit is a valuable way to catch bugs (I disagree, but let's assume it for the sake of argument). Now what if someone finds a place in their code where they legitimately need to build a bigger scoring boolean query? What should they do? If they raise the global static limit, they lose this valuable way to catch bugs. If one accepts the first assumption, then it seems like one needs a way to override this limit on a per query basis. If one does not accept the first assumption, then the limit should simply be removed.

Even at a higher granularity, what if one collection wants one limit and a different collection wants another limit (they could even be different customers collocated in the same JVM). How should that be solved? See SOLR-4586 for a more background / arguments.

If you are against the proposal in this issue, then what is your proposal to address these issues Adrien?

asfimport commented 7 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I think the source of the disagreement is that to me there is almost no good reason to build large boolean queries. So users who really need to do this can increase the limit. They would indeed lose the safety of that limit but it still sounds like a better trade-off to me than removing the max clause count entirely or making it configurable per-query.

As far as multi-tenancy is concerned, it should still be extremely rare that large boolean queries are needed on a per-cluster basis. So I'd make it a per-cluster option and raise the limit if any tenant needs to build large boolean queries.

I understand my assumptions do not seem to match the reality some other committers are seeing based on the comments on SOLR-4586. Also I don't want to prevent Solr from setting a limit equal to MAX_VALUE or making it configurable per-core given my position. So I won't object if you want to follow either of those paths but I still think we should keep things as they are in Lucene since it is our best option from a Lucene perspective.

asfimport commented 7 years ago

David Smiley (@dsmiley) (migrated from JIRA)

By the way, this maxBooleanClauseCount is, I think, extremely similar to maxDeterminizeStates for AutomatonQuery (thus WildCardQuery, RegexpQuery). Both ideas are self-imposed limits that usually don't need to be increased beyond the defaults but cases happen. In both cases, a RuntimeException derivative is thrown if the limit is exceeded. maxDeterminizeStates is configurable for the query instance, and the setting is configurable in the classic query parser's QueryParserBase. maxBooleanClauseCount has neither. Why the difference @jpountz?

asfimport commented 7 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

The difference is probably that people that also work on Solr were not pushing for that. It's easy to get cooperation or an attitude of consensus building in that case. It's also simple to add technically simple enhancements without religious ideas on clearly subjective changes.

asfimport commented 7 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Agreed with you David that it is similar to the maximum number of determinized states on AutomatonQuery. The difference to me is that AutomatonQuery is an expert API while BooleanQuery is one of the main classes of our API and I'd like to keep it lean. I think it is a good example actually: imagine that we want to remove this setter in order to simplify the API, it is going to be a backward break not only on AutomatonQuery but also on other classes that propagate this option like query parsers. These features are much harder to remove than to add. The addition of a single method feels harmless, but eventually they add up and make Lucene harder to understand and use than it should be.

It's easy to get cooperation or an attitude of consensus building in that case

You're telling me about cooperation and consensus while all you have done on that issue so far is describing my arguments as religious views and implying that I'm reasoning like a drunk paranoid. You're just trying to shut me down, not to convince me in order to reach consensus.

asfimport commented 7 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

I've run into the limit when building classifier queries, which are often a combination of many-leaved disjunction queries and a fairly complex exclusion. It seems to be a bit of a blunt instrument. If it's there to avoid users building massive slow queries when they ought to be using a TermsInSetQuery instead, could we not deal with that case in rewrite()? If the query is a pure disjunction, then collect all the TermQuery leaves and build them into a TermsInSetQuery. You could even try and do this only for unique terms, so that scoring isn't effected?

asfimport commented 7 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Rewriting to a TermInSetQuery would not work today since we do not know about whether scores are needed in rewrite. We might be able to change that though. Also we'd need to be careful about infinite loops since TermInSetQuery already rewrites to a BooleanQuery when there are less than 16 terms. I'm a bit torn about such a rewrite rule since TermInSetQuery can't skip in the general case (postings are always entirely consumed), so rewriting to it solely based on the number of clauses feels a bit trappy to me. I think either users should opt in explicitly for it or we should find a heuristic that can detect when a TermInSetQuery will perform better than a BooleanQuery with good confidence, similarly to what we have when it comes to using doc values to execute ranges/geo queries?

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Adrien:

So how would you fix the design? Let's forget the foregoing discussion and start with a clean-slate. You obviously know lots more about Lucene than I do so what's solutions can you think of?

There are solid arguments why this limit is too low. Complexifying Lucene isn't ideal either. Whether it's making the limit configurable per-query is the right solution is obviously debatable.

So what would work? Having a static variable where setting it in one core affects others may have made sense N years ago, but now it's generating problems or we wouldn't be having this discussion. If you were writing this from scratch would you advocate the current design? If not, can you think of a different design that wouldn't have this limitation?

asfimport commented 7 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

we should find a heuristic that can detect when a TermInSetQuery will perform better than a BooleanQuery with good confidence

Exactly this! Which is what we don't have at the moment, we have an API that throws a RuntimeException instead...

What the heuristic should be is an interesting question in itself though. When is it cheaper to build a DocIdSet up-front? I guess if we know that all of the leaves of the disjunction have low costs? Could we add a scorerSupplier to BooleanWeight? Or does the bulk-scoring BooleanScorer actually cope with this anyway?

asfimport commented 7 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

If you were writing this from scratch would you advocate the current design?

I suspect I would not have added the limit, but probably more by lack of care than because I think it is not useful. Now it's here, and if it is common to deal with that large boolean queries then there is a question of whether Lucene is the right tool for the job since all the efforts we make at index-time in order to make reads sequential at search time and keep the search-time memory footprint low are pointless, especially if you're using NIOFSDirectory. So I'm on the fence about removing this setting or changing its default value. I agree the fact it is static is not nice, but it has the good side-effect that we do not have to set options eg. on query parsers in order to make them apply it.

Having a static variable where setting it in one core affects others may have made sense N years ago, but now it's generating problems or we wouldn't be having this discussion

It wouldn't be an issue if it was a cluster setting? What is the value of letting one user generate huge boolean queries and prevent others?

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

bq: It wouldn't be an issue if it was a cluster setting? What is the value of letting one user generate huge boolean queries and prevent others?

This is where the focus on making it configurable per-query is confusing things a bit I think. Making it per-query just provides a way to get around the static nature of the variable and allow limits on a per-core basis is how I've been interpreting it. And why I suggested thinking of it from a clean-slate perspective.

Practically what I want to see is a way that each core can have a different setting that's respected. I don't care if it's per-query or per-core. In the Solr world at least it's becoming quite common for there to be replicas (cores) from multiple collections hosted in the same JVM which may have very different usage patterns. It may make sense that one collection has a much different limit than the others in that case, which we can't do now.

And it's quite easy to blow past the 1024 limit. It doesn't even take a horribly complex query when you start spreading it across 100 fields.