forcedotcom / phoenix

BSD 3-Clause "New" or "Revised" License
558 stars 227 forks source link

Spillable GroupBy implementation, initial commit #627

Closed kutschm closed 10 years ago

kutschm commented 10 years ago

Spillable GroupBy:

The main entry point is in GroupedAggregateRegionObserver. It instantiates a GroupByCache and invokes a get() method on it. There is no: "if key not exists -> put into map" case, since the cache is a Loading cache and therefore handles the put under the covers. I tried to implement the final cache element accesses (RegionScanner below) streaming, i.e. there is just an iterator on it and removed the existing result materialization.

GroupByCache implements a Guava LoadingCache, which an upper and lower configurable size limit. Optimally it is sized as estMapSize / valueSize, since the upper limit is number and not memory budget based. As long as no eviction happens no spillable data structures are allocated, this only happens as soon as the first element is evicted from the cache. We cannot really make any assumptions on which keys arrive at the map, but I thought the LRU would at least cover the cases where some keys have a slight skew and they should stay memory resident.

Once a key gets evicted, the spillManager is instantiated. It basically takes care of spilling an element to disk and does all the SERDE work. It creates a bunch of SpillFiles (spill partition) which are MemoryMappedFiles. Each MMFile only works with up to 2GB of spilled data, therefore the SpillManager keeps a list of these and hash distributes the keys within this list. Once an element gets spilled, it is serialized and will only get deserialized again, when it is requested from the client, i.e. loaded back into the LRU cache. The SpillManager holds a SpillMap for every spill partition (SpillFile). Each SpillMap has access to all the pages of its SpillFile, it also contains a list of bloomFilters, one for every page of the spillFile. Only a single page, the currently active page stays in memory. The in memory data structure of the page is a HashMap for easy key access purposes. An element evicted form the LRU cache is hashed into the correct SpillMap. the spillMap then determines via the list of BloomFilters on which page in the SpillFile the key resides and loads this page into a HashMap in memory. If the key was never spilled before, the SpillMap tries to fill up the current, in memory residing page with this new key. In case it doesn't fit, the current memory page is flushed to disk and a new page is requested. The key is then written to the new in-memory page. Lastly, the bloomFilter is updated so that this key can be discovered again. Loading a key works similarly, if not present in the LRU cache, the CacheLoader sets in. The key gets hashed into the correct SpillMap, the list of bloomFilters is walked to determined the SpillFile page, the key resides on and this page is loaded first into memory and eventually, into the LRU cache. Only for the last step the deserialization is triggered. The aggregators are returned from the LRU cache and the next value is computed. In case the key is not found on any page, the Loader create new aggregators for it.

TODOs: --> Init a newly deserialized Aggregator with its previous value. I added an init function to the interface and receives the client aggs and init itself form this. This could go into a constructor. Also I just did this for two aggregators so far, not sure if this is the way you want it to be.

--> Error handling, right now I mostly re-throw local errors as unchecked RuntimeExceptions. From my experience, even these are not always surfaced, this needs some rework I think.

--> Page defragmentation might happen as spilling occurs. The code tries to fill up the current in memory page. If space is exhausted it simply asks for a new page. Maybe a model that searches for a page that could store this additional element instead of requesting a complete new one, would be better. This would just increase the memory footprint a bit.

-> Tuning knobs that impact performance and memory footprint:

marcel

jtaylor-sfdc commented 10 years ago

Wow, this is great, @kutschm ! How about adding that big comment above to the javadoc on the top level class?