When a GROUP BY is done over a very large set of distinct values, we'll end up throwing a InsufficientMemoryException if the in-memory map gets too large. Instead, we should spool the map to disk in a way that still makes searching it efficient.
The relevant code is in com.salesforce.phoenix.coprocessor.GroupedAggregateRegionObserver, line 237:
aggregateMap.put(key, rowAggregators);
Note that this map is kept per regions worth of data, not across the entire data set, but a single region can have a lot of data too (10-100M rows).
When a GROUP BY is done over a very large set of distinct values, we'll end up throwing a InsufficientMemoryException if the in-memory map gets too large. Instead, we should spool the map to disk in a way that still makes searching it efficient.
The relevant code is in com.salesforce.phoenix.coprocessor.GroupedAggregateRegionObserver, line 237:
Note that this map is kept per regions worth of data, not across the entire data set, but a single region can have a lot of data too (10-100M rows).