apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.69k stars 1.04k forks source link

Integer overflow in total count in grouping results [LUCENE-9302] #10342

Open asfimport opened 5 years ago

asfimport commented 5 years ago

When doing a Grouping search in solr cloud you can get a negative number for the total found.

This is caused by the accumulated total being held in an integer and not a long.

 

example result:

"responseHeader": { "status": 0, "QTime": 9231, "params": { "q": "decade:200", "indent": "true", "fl": "decade", "wt": "json", "group.field": "decade", "group": "true", "_": "1542773674247" } }, "grouped": { "decade": { "matches": -629516788, "groups": [ { "groupValue": "200", "doclist": { "numFound": -629516788, "start": 0, "maxScore": 1.9315376, "docs": [ { "decade": "200" } ] } } ] } }}

 

result without grouping: "responseHeader": { "status": 0, "QTime": 1063, "params": { "q": "decade:200", "indent": "true", "fl": "decade", "wt": "json", "_": "1542773791855" } }, "response": { "numFound": 3665450508, "start": 0, "maxScore": 1.9315376, "docs": [ { "decade": "200" }, { "decade": "200" }, { "decade": "200" }, { "decade": "200" }, { "decade": "200" }, { "decade": "200" }, { "decade": "200" }, { "decade": "200" }, { "decade": "200" }, { "decade": "200" } ] }}


Migrated from LUCENE-9302 by Ian, 1 vote, updated Apr 06 2020 Attachments: SOLR-13004.patch (versions: 2) Linked issues:

asfimport commented 5 years ago

ruchir choudhry (migrated from JIRA)

Can you pls assign it to me, I can take this up.

Thanks -Ruchir

asfimport commented 5 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Ruchir:

Just go ahead and work on it, JIRAs can't be assigned to non-committers....

asfimport commented 4 years ago

Ishan Chattopadhyaya (@chatman) (migrated from JIRA)

This is something I want to fix. When matches (as collected from various shards) exceeds integer limit, it overflows.

At the heart of the implementation is a Lucene class for TopGroups object, which has totalHitCount as an integer. The merge method in this class adds up the counts from various shards into a total which overflows.

Also, totalGroupsCount can also be a number more than integer range and should be a long.

asfimport commented 4 years ago

Ishan Chattopadhyaya (@chatman) (migrated from JIRA)

Here's a WIP patch.

Given the change to TopGroups, should this be a LUCENE issue now? @jpountz, @dsmiley any thoughts, please?

asfimport commented 4 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I suggest moving to a Lucene issue since the core of the change is in Lucene. Its debatable if the Lucene side it s a bug or improvement; I tend to think the latter. But I can see from a Solr user perspective, this shows as a bug. Not a big deal any way.

asfimport commented 4 years ago

Ishan Chattopadhyaya (@chatman) (migrated from JIRA)

Thanks David, moved to a Lucene issue now. Would like some early feedback on the patch here. Mainly curious whether there's some reason why the totalHitCounts is an integer, not a long? Given that the merge() method's javadocs say the following, I feel the total hits across all shards can legitimately overflow the integer range (this is happening in our production cluster).

 

/** Merges an array of TopGroups, for example obtained
 *  from the second-pass collector across multiple
 *  shards.  Each TopGroups must have been sorted by the
 *  same groupSort and docSort, and the top groups passed
 *  to all second-pass collectors must be the same.
 *
 * <b>NOTE</b>: We can't always compute an exact totalGroupCount.
 * Documents belonging to a group may occur on more than
 * one shard and thus the merged totalGroupCount can be
 * higher than the actual totalGroupCount. In this case the
 * totalGroupCount represents a upper bound. If the documents
 * of one group do only reside in one shard then the
 * totalGroupCount is exact.
 *
 * <b>NOTE</b>: the topDocs in each GroupDocs is actually
 * an instance of TopDocsAndShards
 */
public static <T> TopGroups<T> merge(TopGroups<T>[] shardGroups, Sort groupSort, Sort docSort, int docOffset, int docTopN, ScoreMergeMode scoreMergeMode) {
asfimport commented 4 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

The in-progress patch looks good. I'm not familiar with how things work in Solr when the client and server are on different versions or when not all nodes of the cluster are on the same version, does the code need to be able to handle the case when the hit count is still returned as an integer by another node?

asfimport commented 4 years ago

Ishan Chattopadhyaya (@chatman) (migrated from JIRA)

The in-progress patch looks good. I'm not familiar with how things work in Solr when the client and server are on different versions or when not all nodes of the cluster are on the same version, does the code need to be able to handle the case when the hit count is still returned as an integer by another node?

Thanks for your review, Adrien. Right, need to add backcompat handling (wondering how ugly that could be; I hope not). Also, handle the totalGroupCount to be a long.

Shall update the patch soon.

asfimport commented 4 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Hopefully replacing "(Long) foo" with "((Number) foo).longValue()" would work?

asfimport commented 4 years ago

Ishan Chattopadhyaya (@chatman) (migrated from JIRA)

Opened SOLR-14381 to tackle the Solr changes. Will keep this issue focused on the Lucene change.

chatman commented 2 years ago

I plan to update this PR (https://github.com/apache/lucene-solr/pull/1410) to merge against main branch shortly.