NationalSecurityAgency / datawave

DataWave is an ingest/query framework that leverages Apache Accumulo to provide fast, secure data access.
https://code.nsa.gov/datawave
Apache License 2.0
563 stars 246 forks source link

Duplicate results possible #564

Open apmoriarty opened 5 years ago

apmoriarty commented 5 years ago

It is possible to return duplicate results to the user when the shardsPerDayThreshold property is set.

This was discovered when using the collapseUids option which rolls up all document ranges into shard ranges. Shard ranges are returned from the index to the query iterator until the shardsPerDayThreshold is triggered. At this point a day range is sent to the query iterator and the previous ranges are searched again.

For example, if a document hits on shard 20190314_0 through _9, and the shardsPerDayThreshold is set to 5, the following ranges would be searched. A user would see duplicate results for shards 0-4.

There are several ways to mitigate this issue.

  1. Set the shardsPerDayThreshold to be greater than the number of shards per day
  2. Set the eventPerDayThreshold to the maximum integer value
ivakegg commented 4 years ago

It is suggested that the code in the RangeStreamScanner that rolls up shard ranges to day ranges should be removed altogether.