TIBCOSoftware / snappydata

Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in one cluster
http://www.snappydata.io
Other
1.04k stars 200 forks source link

hashAggregates: group by sum decimal issue #1554

Open ssathi opened 3 years ago

ssathi commented 3 years ago

Aggregating negative decimal values produces incorrect results.

table:

charge_item_name (string) charge_type (string) charge_amount (decimal(16,2)
WHT tax -300.00
WHT tax -300.00
WHT tax -300.00
WHT tax -240.00
WHT tax -300.00
WHT tax -300.00
WHT tax -300.00
WHT tax -300.00
WHT tax -300.00
WHT tax -300.00
WHT tax -300.00
WHT tax -200.00
WHT tax -300.00
WHT tax -300.00

Spark SQL query:

SELECT itemchargingtype, itemdescription, count(*) AS `count`, sum(itemamount) AS `sum`
FROM `transaction_items`
where itemdescription = 'WHT'
GROUP BY itemdescription, itemchargingtype
result: charge_type charge_item_name count sum
tax WHT 14 1,202.88

After disabling snappydata.sql.hashAggregateSize=-1 snappydata.sql.useOptimizedHashAggregateForSingleKey=false

it produces correct values.

result: charge_type charge_item_name count sum
tax WHT 14 -4040.00
sumwale commented 3 years ago

Thanks for reporting.

Tested this and it looks to be a bug in the new BufferHashMap based implementation that reduces memory overhead for large aggregates and DISTINCT. For now you can switch to using the older implementation "set snappydata.sql.optimizedHashAggregate=false" which is as fast (and in many cases faster) than the newer one albeit may fail for very large aggregation/DISTINCT results. If this works for your use-cases it is much better than turning it off completely.