forcedotcom / phoenix

BSD 3-Clause "New" or "Revised" License
558 stars 227 forks source link

Spool distinct GROUP BY map to disk if too big #500

Closed jtaylor-sfdc closed 10 years ago

jtaylor-sfdc commented 11 years ago

When a GROUP BY is done over a very large set of distinct values, we'll end up throwing a InsufficientMemoryException if the in-memory map gets too large. Instead, we should spool the map to disk in a way that still makes searching it efficient.

The relevant code is in com.salesforce.phoenix.coprocessor.GroupedAggregateRegionObserver, line 237:

                        aggregateMap.put(key, rowAggregators);

Note that this map is kept per regions worth of data, not across the entire data set, but a single region can have a lot of data too (10-100M rows).

jtaylor-sfdc commented 10 years ago

Implemented by Marcel. Awesome job!