apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.02k forks source link

Raise maxClauseCount in BooleanQuery to Integer.MAX_VALUE [LUCENE-4835] #5900

Open asfimport opened 11 years ago

asfimport commented 11 years ago

Discussion on SOLR-4586 raised the idea of raising the limit on boolean clauses from 1024 to Integer.MAX_VALUE. This should be a safe change. It will change the nature of help requests from "Why can't I do 2000 clauses?" to "Why is my 5000-clause query slow?"


Migrated from LUCENE-4835 by Shawn Heisey (@elyograg), 1 vote, updated Jun 17 2017 Linked issues:

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I'm not sure it should be Integer.MAX_VALUE. you can't even create arrays this big with current jvms. this wouldn't be a safe change. it would change the natural of help requests from "why did i get TooManyClauses exception" to "why did i get super-strange exception: is this a bug?"

asfimport commented 11 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Yeah, that comes from me - every time I use that as an example of what I'm meaning, I get in trouble :)

What I meant by it is that there should be no limit, not necessarily that that should be the limit in code.

asfimport commented 11 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

it would change the natural of help requests from "why did i get TooManyClauses exception" to "why did i get super-strange exception: is this a bug?"

It would remove even more of those silly "why did I get TooManyClauses exception" questions that tons can still get at 1024 or whatever it is. How many people will be bitten with what you talking about? That many explicit bq's? Well that one in a million guy will bring his expcetion to the list and mention, oh, im doing 5 billion boolean clauses or whatever.

This silly artificial limit hasnt even kept pace with hardware improvements over the years :) Not that it matters - it's arbitrary to begin with.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

i was thinking about someone who has a bug in their code and accidentally keeps adding to the same BQ. I feel like i've done this writing tests before, probably multiple times. If instead my code seemed hung, only to finally get OOM or some ArrayStoreException or something wacky in some strange place, it would take me longer to realize my mistake. :)

asfimport commented 11 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

If that was really a concern - which for me, it wouldn't really be until I started seeing the reports - but even if it's your concern, a huge limit still would be much better. There is not really anything special about the fairly low current value.

Im no so worried about this type of thing...you might think that you want every hit back no matter what and ask for like the top 1 million hits and require a huge pq and blow our your ram and oh how confusing...but people seem to get by without us throwing arbitrary exceptions.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Its not really a huge concern. I'd like for their still to be a limit, e.g. that i could set in lucenetestcase or even in solr to prevent tripping my own self, even if it has a different default value.

asfimport commented 11 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Alright, well thats fine with me - as long as the limit is not so silly low, I'll at least be happier. I'd be happiest to have the whole concept go away, but I'll compramise to just suffocate it a bit.

asfimport commented 11 years ago

Shawn Heisey (@elyograg) (migrated from JIRA)

Robert's concern didn't occur to me at all, because I reside on the Solr side of the fence. One approach that might satisfy both sides: 1) Be conservative in Lucene by leaving the value at 1024 or increasing it to something that would stress modern hardware. 2) In Solr, explicitly set it to Integer.MAX_VALUE, or REALLY_BIG_NUMBER so it's less likely to cause overflow problems. If REALLY_BIG_NUMBER is chosen, we'd probably have to leave maxBooleanClauses parsing in, but it could be reduced to a commented section in the example config and could be left out of most of the test configs.

A Lucene user, because they are already in the 'custom code' realm, can increase the value if they need to, and Solr (which deals in query strings rather than complex objects) would effectively have the limit removed.

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Bulk move 4.4 issues to 4.5 and 5.0

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Move issue to Lucene 4.9.

asfimport commented 9 years ago

Shawn Heisey (@elyograg) (migrated from JIRA)

I think the discussion here does leave room for an increase in the default maxBooleanClauses value, just not to Integer.MAX_VALUE. Rob's objections to that setting do have technical merit. My initial WAG as to a new value is 16384 ... that would satisfy the requirements of every situation I've actually seen when Solr users must increase the value, but it would still be low enough to catch seriously abnormal code/config.

I'm still pursuing SOLR-4586 to remove the limit entirely in Solr, though if we increase the default in Lucene, the default in Solr should also get a bump.

asfimport commented 9 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

If there is to be an arbitrary limit, I think it should be much lower, not higher. That way poor people may be more likely to hit it in testing rather than in production as their system grows.

But really, I disagree with having any arbitrary limit. The performance curve as one adds terms is nice and smooth. Adding in an arbitrary limit creates a bug in working code (your system suddenly stops working when you cross a threshold), to try and prevent a hypothetical code bug ( "someone who has a bug in their code and accidentally keeps adding to the same BQ" ).

But this hypothetical code bug off continuously adding to the same BQ would lead to either an OOM error, or array store error, etc... ,basically something that would be caught at test time. And really, there are hundreds of places in code where you can accidentally continuously add to the same data structure... ArrayList, StringBuilder, etc. It would be horrible to have arbitrary limits for all of these things.

asfimport commented 9 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I whole-heartedly agree with Yonik's opinion. I simultaneously had the idea of making the limit much lower. How about 64?

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

-1 to lowering the limit in lucene, just because you guys have sour grapes about a solr issue.