eclipse / omr

Eclipse OMR™ Cross platform components for building reliable, high performance language runtimes
http://www.eclipse.org/omr
Other
936 stars 393 forks source link

A nursery evacuation algorithm with improved throughput and localization of references #2361

Closed ktbriggs closed 5 years ago

ktbriggs commented 6 years ago

[TL;DR] [See also #2664]

[Note (March 27 2019): Evacuator development has been capped. I am leaving this comment intact for archaeologic reasons and have provided an updated overview of evacuator design, with discussion of the benchmarking results reported in #2664, in the comment subsequent to this, below.]


I'm putting a new nursery evacuation algorithm for generational garbage collection forward for review and comments from the OMR community. Below is a sketch of evacuator design, intended to promote discussion within the OMR community. Existing source code, written offline to verify the evacuator design and to demonstrate evacuator simplicity and performance, will be made available soon, pending resolution of concerns relating to introducing a nearly completed work into OMR and OpenJ9. I welcome constructive and pragmatic critique and recommendations.

Background

Evacuator was developed off-line as proof of concept to verify a idea relating to the direction of copy during generational collections, ie the movement of objects from evacuation space to survivor and tenure regions. At bottom generational collection is a straightforward process that consumes most of its processing time in a single method, memcpy(). Specifically, evacuator selectively filters out large (>256 byte) objects for breadth-first (long-distance between referring pointer and referent object) copying and limits hierarchical (copying referent object close to referring pointer) copying to small objects, whereby the likelihood of hardware cache-line containment of referring pointer and referent head is increased for small objects. The likelihood of cache-line containment for reference arcs decreases with increasing distance between the object being scanned for referent objects in evacuation space and the next location available for receiving copied referents in survivor or tenure space. Scavenger's aliasing algorithm attempts to minimize this distance by opportunistically switching alias between survivor and tenure copy/scan caches, but this provides a limited degree of freedom for control and, with large copy/scan caches, frequently diverges to scan-copy distances that do nothing to improve locality of reference while also inhibiting production of shareable work between gc threads.

With current scavenger, these effects can be seen by limiting the minimum and maximum sizes for copy/scan caches (ie setting -XXgc:scanCacheMinimumSize = X = -XXgc:scanCacheMaximumSize) and measuring the proportions of objects copied within-cache and within page (C<P), in cache and not in page (C>P), and out of cache (C>C) using 8/8K and 32/32K min/max cache sizes. Also measured were CPU utilization, the number of cache allocations and the number of bytes scanned between alias switching. Benchmarking on a 128-core Power8 machine (limited to 8 CPUs for the runs) produced these results:

[8/8K] (C<P; C>P; C>C) = (0.507; 0.110; 0.382) (92% thread utilization, 348 cache allocations/ms, average 309 bytes scanned between alias switching)

[32/32K] (C<P; C>P; C>C) = (0.379; 0.149; 0.472) (88% thread utilization, 118 cache allocations/ms, average 547 bytes scanned between alias switching)

The 8/8K runs had better thread utilization but much lower throughput than the 32/32K runs, probably because more time was spent waiting on the allocation locks and managing the increased volume of active scan caches. Additionally, the 8/8K runs were limited by more frequent context switching overhead associated with switching alias between caches. The 8/8K runs achieved a higher proportion of copy in page operations because there were more copy caches available to copy into and the small cache size limited useless aliasing in cache but out of page.

Another limiting factor in current scavenger design is the tight coupling of scan and copy concerns within a single data structure, the copy/scan cache (MM_CopyScanCache). On one hand, the main concern with regard to copying is the allocation of whitespace -- that is, allocation of survivor or tenure memory to copy evacuated material into. On the other hand, the main scanning concern is to obtain copied material to scan. Tight coupling of these concerns within the MM_CopyScanCache structure prevents whitespace allocation sizes from being determined independently of work packet size. Copying favors the allocation of large address ranges to copy into, except at the end of each heap scan where excessive allocations leave substantial fragments that are then inaccessible to the application (macro fragmentation). Scanning also favors large address ranges, but needs to be adjusted dynamically as required to ensure adequate distribution of workload between gc threads throughout the gc cycle. Current scavenger design conflates these concerns within the copy/scan cache structure, with the result that a substantial amount of whitespace is bound up and inaccessible within copy/scan caches that are held on the global work queues/lists that hold the current backlog of scan work. The copy/scan cache structure is also overloaded with a passive object scanner instance that consumes additional space within queued copy/scan caches, which increases the amount of system memory that must be consumed to accommodate the backlog of scan work.

Evacuator design aims to ameliorate some of these concerns by replacing the aliasing algorithm with a stack-based algorithm that limits close copying to small objects and constrains the maximum distance between scan and copy head for close copying to <~4K. The threshold for small object size and the constraint for maximum close copy distance are controllable parameters that can be adjusted dynamically in response to instrumentable related runtime factors. Allocation sizes for copy concerns and per-thread thresholds for releasing scan work to be available for consumption by other gc threads are also controllable and can be adjusted dynamically and independently per gc thread. Instrumentation, primarily relating to the rate of copy production relative to rate of scanning progress and the presence of stalling gc threads, produces runtime feedback that can be used to alter the controllable parameters to obtain even and continuous distribution of work and reduce fragmentation of survivor and tenure space.

Evacuator Design

A sketch of evacuator design is provided here as an assist to reviewers and to developers who might be engaged in further development, if warranted. In this presentation, the terms 'inside copy' and 'outside copy' refer to copy that is directed within a scan stack frame (analogous to scavenger's aliased copy) and copy that is directed to outside (survivor or tenure) copyspaces (analogous to breadth-first copy in scavenger), respectively.

Main classes and structures

An instance of an evacuator controller class MM_EvacuatorController instantiates and facilitates the inter-working of a number of instances of an evacuator class MM_Evacuator. The controller class subclasses MM_Collector and is itself subclassed by a modified version of MM_Scavenger that has been relieved of all responsibility for most of the work depending from MM_Scavenger::workThreadGarbageCollect(), which is replaced by MM_Evacuator::workThreadGarbageCollect(). Other classes and methods that have been modified to accommodate this basic change include:

MM_EnvironmentStandard: includes a pointer to the MM_Evacuator that is bound to the gc thread

MM_ParallelScavengeTask: bind/unbinds MM_Evacuator instance to/from a gc thread in its run() method

MM_ScavengerStats: added 2 arrays _small_object_counts[] and _large_object_counts[] to record histograms of small and large object sizes

Net new participating classes are described here from the bottom up.

MM_EvacuatorBase

This class provides cross-cutting debugging and logging/tracing support and defines static constant values for fundamental controllable parameters that have fixed values in the proof-of-concept evacuator implementation. If evacuator development moves forward, many of these values will become dynamic and controllable at runtime. The tracing/debugging features enabled by this class are presently disabled for production builds but may be enabled if desired for developer builds.

low_work_volume: minimum volume of work available in gc thread worklist. Lower volumes trigger clipping of whitespace (TLH) allocation sizes.

copy_scan_top_rating: another throttling factor in determining whitespace allocation sizes. This represents a threshold value for the ratio of aggregate copied and scanned byte counts, which is the aggregate rate of copy production since the start of the gc cycle. Lower rates will further reduce whitespace allocation sizes and work release thresholds.

min_tlh_allocation_size, max_tlh_allocation_size: upper and lower bounds for whitespace allocation sizes are analogous to scavenger's min/max cache sizes and are immutable over the lifetime of the OMR VM.

max_scanspace_remainder, max_copyspace_remainder: maximum amount of trailing whitespace that can be released from an otherwise full scanspace or copyspace. The former is currently fixed at 32 bytes, the latter at 768 bytes. Remainders released from scanspaces are always discarded, while copyspace fragments >32 bytes in size are retained on a whitelist for reuse as stack whitespace. These values limit the amount of fragmented whitespace released from the scan stack or outside (survivor/tenure) copyspaces.

max_scan_stack_depth: number of stack frames (scanspaces) available on the scan stack, currently fixed at 32 frames. The scan stack is bounded, limited to at most this many frames, but can be further limited to force more copy to be directed to outside copyspaces. In particular, limiting the number of stack frames to 1 forces all copy to be flushed to outside (breadth-first) copyspaces always. In the presence of stalled gc threads, running gc threads can limit stack depth to the current popped frame to force flushing of next frame only to outside copyspace -- a response similar to but more controllable and immediate that scavenger's strategy of inhibiting aliasing in the presence of stalled gc threads.

max_inside_object_size: the maximum size of an object that will be admitted for inside copying, that is, for copying within a the scanspace of the current stack frame as opposed to copying to an outside (survivor or tenure) copyspace. In the current implementation of evacuator this is a fixed value but it can easily be made dynamic. One application of dynamic inside copy sizing is to reduce the inside copy size for applications that involve high proportions of small objects. For example, the benchmark referenced in Background above involves ~97% small (<=256 byte) objects and produces relatively little shareable work in its outside copyspaces. Reducing the small object size threshold to 128 bytes for that benchmark could produce a more even distribution of work while increasing locality of reference for objects copied inside stack frames.

inside_copy_size: width of a stack frame, currently fixed at 4K. Inside copying proceeds within a stack frame until the copy head reaches or exceeds this distance from the base of the scanspace contained within the frame, after which remaining object slots between the scan head and copy head must be pushed up to the next frame. This value can also be changed dynamically, which will permit, for example, quick resolution of remainder work in outside copyspaces when a gc thread's worklist runs dry. In that case, the remainder work can be more quickly dispatched by increasing this value in conjunction with increasing the maximum inside copy size to force all remaining copy to be directed inside the stack. Without this feature, clearing incomplete outside copyspaces tends to produce thrashing between stack and outside copyspaces. This is especially problematic in applications that involve long linked lists with substantially sized nodes.

max_whitelist: the maximum number of whitespace fragments that can be retained on a whitelist. There is a whitelist for each of survivor and tenure spaces. Each whitelist is a priority queue of whitespace fragments that have been released from the tails of outside copyspaces, always presenting the largest fragment on top. Fragments <32 bytes in size are immediately discarded (filled with holes to keep the heap walkable). At the end of each gc cycle the survivor whitelist discards all of its remaining whitespace fragments; the tenure whitelist is retained between back-to-back generational gc cycles and is flushed, discarding all of its whitespace, before each global gc cycle.

MM_EvacuatorHistory

This class is a bit of a hangover from an artifact introduced to MM_Scavenger when it was refactored into OMR. At the start of each generational gc cycle the controller makes a guess at how much live material will be evacuated and determines a reporting threshold Tr by dividing that volume by (64 * #gc threads). Each gc thread maintains a count of the number of bytes it has copied since start of gc or last report, and if either count exceeds Tr it reports its counters to the controller and resets them for the next reporting cycle. The controller maintains aggregate sums for bytes copied and bytes scanned and uses these to track the aggregate rate of copy production, which is a factor in determining whitespace allocation sizes. Evacuator threads also consider this factor when determining when to release work from their outside copyspaces.

While only aggregate copied/scanned byte counts are used in the present proof-of-concept version of evacuator, a number of useful statistics are available from this reporting model. The aggregate ratio of copied bytes to scanned bytes Cg/Sg is infinite during breadth-first root and remembered set scanning and converges to unity at the end of each heap scan, so values close to unity provide an indication that the gc cycle may be coming to an end and whitespace allocations and work release thresholds should be trimmed accordingly. Future development of within-cycle instrumentation will focus on delta values for aggregate and per-threads copied/scanned byte counts, which will enable calculation of temporally localized aggregate and per-thread copy production rates. By comparing current delta thread rate C't/S't versus delta aggregate rate C'g/S'g it will be possible to identify threads that are lagging behind or surging ahead, and this will allow work release thresholds and whitespace allocation sizes to be computed individually for each thread.

MM_EvacuatorWorklist

This class consists of a linked listed of MM_EvacuatorWorkPacket structures, each comprised of a base pointer pointing to the head of a contiguous range of copied bytes that have yet to be scanned, a value representing the length in bytes of the contained scan work, and a pointer to the next work packet on the list. Each evacuator thread maintains its own worklist, although it may pull work packets from the worklists of other threads if its own list runs dry. The controller serves as a cyclic bus to allow evacuator threads to sample the volume of work on other evacuator worklists; threads that are in danger of stalling will locate the thread with the greatest volume of work and pull work from that thread's worklist until the stalling thread has at least as much work as the contributing thread.

Memory for worklist items is obtained from chunks of system memory allocated by evacuator threads as required. These chunks are partitioned into a collection of empty MM_EvacuatorWorkPacket items that are then maintained on an MM_EvacuatorFreelist, per thread. Each evacuator threads pull items from the freelist as required to populate its worklist. Presently these chunks of system memory are retained per thread between gc cycles; future work will enable the controller to collect these at the end of each gc cycle and redistribute them as required during the next cycle. Also deferred for for future work, if warranted, is limiting the amount of system memory available for this purpose and failing over to allocation from the heap when this limit is exceeded, analogously to the manner in which MM_Scavenger fails over to heap allocation when its pool of MM_ScanCopyCacheChunk items underflows.

MM_EvacuatorWhitelist

Whitespace that is allocated to the scan stack for inside copying is constrained to leave at most 32 bytes of trailing whitespace before it is renewed. This is accomplished by directing object copy outside, regardless of object size, whenever a small object is too large to fit in the remaining stack whitespace. When <32 bytes remain, the remainder is discarded (filled with holes). Outside copyspaces are similarly constrained, but the constraint requires only that <768 of whitespace can remain before a new whitespace allocation can be attempted to replenish the copyspace. Objects that are too large to fit in a copyspace with >768 bytes remaining whitespace are accommodated by making a size-specific allocation to hold the single object, until smaller objects arrive to fill the copyspace past the 768 byte remainder threshold. At that point the remaining whitespace of up to 768 bytes is added to a whitelist for consumption by the scan stack.

Each thread maintains two whitelists (survivor and tenure) to contain fragments of whitespace trimmed from completed copyspaces. Each whitelist is structured as a priority queue, presenting the largest whitespace fragment at the head of the list. Whenever the scan stack runs out of whitespace for inside copy it will pull the largest item from its whitelist before attempting to allocate more whitespace from the heap. In this way the scan stack burns down these fragments over time. Results from all application drivers tested to date indicate that this is very effective in reducing fragmentation within gc cycles. Whitelists have a bounded capacity (15 items), at capacity the smallest items are discarded to accommodate larger items as they arrive. Items <32 bytes in size are always discarded.

The table below shows fragmentation resulting in the first generational gc cycle in a trial run of evacuator with a different bencchmark (all values hexadecimal) with 48 gc threads. Micro-fragmentation represents the volume of whitespace discarded from whitelists (survivor + tenure) during the course of the cycle, macro-fragmentation represents the volume retained in stack whitespace and survivor copyspaces at the end of the cycle. In this instance, micro-fragmentation amounts to about 0.015% of the total volume copied/scanned, macro-fragmentation amounts to about 0.180%.

bytes copied bytes scanned micro macro
41b46820 41b46820 29780 1e65d0

Macro-fragmentation in this case is higher than it would be if delta copy production rates were used to limit whitespace allocation sizes, as mentioned above, because the aggregate copy production rate includes information from the beginning of the gc cycle and is relatively insensitive to temporally localized changes such as occur at the end of each gc cycle as evacuator threads encounter progressively more instances of objects that have already been copied by other threads. This will be more carefully tuned in future work.

Remainder whitespace in the survivor whitelists is discarded at the end of each cycle; this memory is then unavailable for application object allocation until the next cycle. Remainder whitespace in the tenure whitelists is retained between generational gc cycles and is available for use at the beginning of the next generational cycle; tenure whitespace is discarded only at the commencement of the next global gc cycle.

MM_EvacuatorCopyspace

Each evacuator thread maintains three instances of this class, one to receive large objects directed to survivor space, one for tenure space, and one for large (>768 bytes) objects that cannot be accommodated in the remaining whitespace of the intended copyspace. Whitespace for the outside copyspaces is allocated as required subject to the remainder size constraint noted above, allocation sizes are determined as a function of aggregate copy production rate and the number of stalled evacuator threads at the time of allocation. Whitespace for the large object copyspace is allocated to fit a single object entirely. The large object copyspace does not retain any remainder whitespace and is converted to a work packet containing a single object as soon as copying is complete.

Each copyspace is a simple structure/class comprised of a base pointer, copy head pointer, and an end pointer, together with a boolean value to indicate whether the contained whitespace was allocated from the heap's large object area (LOA) or not.

MM_EvacuatorScanspace

Each evacuator maintains a stack of scanspace elements, each of which constitutes a single stack frame. This class subclasses MM_EvacuatorCopyspace and extends it with a scan head pointer and a limit pointer and space to instantiate in place an object scanner. The limit pointer is set at the next frame boundary (next address >base address and congruent to 0 mod 4096) and determines the width of each stack frame. When an object is pushed onto a new stack frame, the scan head is set to the base pointer and the copy head is set to the first byte after the pushed object. The limit is set to point to the next frame boundary. As the initial pushed object is scanned, small dependent referents that are to be copied to the same (survivor or tenure) space as the pushed object are copied to the copy head, large objects and all objects that are to be copied to the other (tenure or survivor) space are directed to outside copyspaces. This process proceeds until the scan head reaches the copy head and the scanspace is popped off the stack or the copy head reaches or passes the limit, at which point copy from all remaining object slots between scan and copy heads must be pushed up the stack.

The scan stack is bounded, containing 32 frames, and is allocated on the runtime stack of the evacuator thread in MM_Evacuator::workThreadGarbageCollect(). In practice the stack limit is seldom reached except in the context of long linked list-like structures. When the stack limit is reached, and further pushing is no longer possible, scanned slot objects are flushed to outside copyspaces until the stack pops again to free up a frame for pushing new objects.

Scan stack geometry (frame width, stack limit) is flexible and can be adjusted at any time. This enables evacuator instances to, for example, force a greater proportion of copy to outside copyspaces to increase the volume of shareable work produced by the thread by decreasing the stack limit (stack height) in response to indications of stalled threads. This effect can also be achieved by reducing the threshold for small objects from 256 bytes to, say, 128 bytes -- a technique that could enhance work sharing in applications that involve very high proportions of small objects. In another case, when an evacuator's worklist runs dry and the only available work is remainder copy in outside copyspaces, it may be necessary to repeatedly return to the outside copyspaces as the only sources of work before the outside copyspaces are completely cleared, which must be done before the evacuator thread can obtain work from other threads. This case becomes especially problematic when one of the remainder copyspaces contains a node from a long linked list of objects. This situation, which arises frequently in some benchmarks, can be ameliorated by maximizing stack width and small object size threshold, thereby forcing most objects to be copied inside the stack and reducing thrashing between scan stack and outside copyspaces.

MM_Evacuator

This class is the workhorse of evacuator design and implementation. Each gc thread is bound at the commencement of a generational gc cycle to an instance of this class. These instances are passivated and maintained by the controller between cycles. The main method embodied in this class is workThreadGarbageCollect(), which directs the activity of each evacuator thread through the root and remembered set scanning, heap scanning, and clearing stages of each generational gc cycle. At a very high level of description, the workflow of each evacuator thread can be represented by a simple regular expression in terms of the main methods of MM_Evacuator:

scanRemembered scanRoots scanHeap scanComplete (scanClearable scanHeap scanComplete)*

This workflow is presented in more detail below.

MM_EvacuatorController

The controller is responsible for instantiating and passivating MM_Evacuator instances as required, and provides a framework for cooperative inter-working of and load balancing between evacuator instances during generational gc cycles. In the current proof-of-concept implementation, MM_EvacuatorController subclasses MM_Collector and is itself subclassed by a reduced version of MM_Scavenger. This arrangement allows the evacuator design to fit easily within the existing collector configuration framework, at least as far as gencon gc policy is concerned, and is flexible enough to allow evacuator to be deployed with minimal refactoring in other generational configurations such as the balanced gc policy.

An array of MM_Evacuator instances is maintained by the controller. Evacuators are instantiated once and maintained over the lifetime of the controller. Two bitmaps are maintained to track evacuator threads that are stalled (waiting for work) and threads that are resuming from the stalled state to continue with found work or complete the heap scan. (Note: while the bitmaps are limited to 64 bits/threads, OpenJ9 Java currently permits a maximum of 64 gc threads. If this restriction is removed, the bitmap model can easily be extended to employ an array of bitmaps). Access to these bitmaps is regulated by a controller mutex, which is taken by any evacuator that is looking for work after exhausting its own worklist and outside copyspaces. If no work is found, the evacuator waits on the controller mutex to be notified when any other evacuator places work on its own worklist or all evacuators exhaust their worklists, completing the heap scan.

Two allocation limits are maintained for each of survivor and tenure space. One of these limits the maximal size of TLH whitespace allocations for outside (to copyspace) and inside (stack scanspace) object copy, the other limits the size of whitespace allocations for individual large objects. These limits decrease in response to allocation failures and are intended to prevent repeated allocation requests that are sure to fail.

The controller sets the reporting threshold (in terms of bytes copied or scanned) for the gc cycle and maintains two counters to track the aggregate number of bytes copied and bytes scanned over the course of the gc cycle. It also maintains counters to record the number of bytes allocated but discarded during the gc cycle (micro-fragmentation) and at the end of the cycle (macro-fragmentation).

Language-specific concerns

As with other OMR GC components, evacuator design calls for a language-specific delegate class to handle generic concerns that require implementation within the client language runtime. The most important of these concerns is the identification of root objects held by the language runtime at the commencement of each generational cycle. Other language-specific functions include (not an exhaustive list) construction of object scanners, which identify object slots that contain references to other heap objects, identification of clearable objects during the clearing phase, and generic informational methods that signal the start and end of a gc cycle.

In the proof-of-concept implementation of evacuator for OpenJ9, a straw-man MM_EvacuatorDelegate class provides a facade for existing Java artifacts that implement root scanning and clearing: MM_ScavengerRootScanner and MM_ScavengerRootClearer. These classes have been lightly refactored to redirect calls to MM_Scavenger object copying methods to equivalent methods expressed by MM_EvacuatorController.

Evacuation Workflow

At the start of a generational gc cycle the collector receives a call to MM_Scavenger::masterThreadGarbageCollect(), which in turn calls MM_Scavenger::scavenge() to marshal the gc threads to evacuate the live objects in evacuation space to survivor and tenure spaces. To accommodate the evacuator model, MM_Scavenger has been modified to call, before the scavenge() call, MM_EvacuatorController::prepareForEvacuation() which determines the reporting threshold on the basis of an estimate of the volume of material to be evacuated and the number of gc threads and sets the allocation limits to their initial values. In most cases, the initial TLH allocation sizes are determined by default or user-defined (-XXgc:scanCacheMaximumSize) values. For very small heaps, <2mb or <32mb, the initial TLH allocation limits are adjusted down to 25% or 50% of the default size, respectively.

The evacuator threads are started and bound to evacuator instances after the controller has completed preparing itself for the gc cycle. Each evacuator instance then enters MM_Evacuator::workThreadGarbageCollect() to complete its part in the gc cycle. This begins with root and remembered set scanning, which proceed in almost identical manner as for current scavenger with small adjustments as required to fit into the evacuator model. During this phase, evacuated material is copied breadth-first into the evacuators' outside copyspaces, TLH whitespace allocation sizes are limited to twice the minimum TLH size and the work release threshold is limited to the minimum TLH size in order to ensure that a larger number of work packets are available in evacuator worklists at the end of this phase. Whenever a copyspace accumulates copied material past the work release threshold, the completed span of copied material is stripped out of the copyspace into a work packet and placed on the evacuator's worklist and the copyspace is rebased to include only the remaining whitespace. Root and remembered set scanning is implemented in MM_Evacuator::scanRoots() and MM_Evacuator::scanRemembered().

After it completes root and remembered set scanning the evacuator enters its main heap scanning method MM_Evacuator::scanHeap(). Heap scanning proceeds by repeatedly pulling an item from the worklist (MM_Evacuator::loadWork()) and pushing it into the scanspace in the bottom frame of the scan stack (MM_Evacuator::push(MM_EvacuatorWorkPacket *)). As long as there is any work on the stack, MM_Evacuator:copyInside() is called to scan the objects in the scanspace contained in the current stack frame. For each sequential object slot that references evacuation space, the evacuator determines its destination space (survivor or tenure) on the basis of the object's age. Objects that are destined for the same space (survivor or tenure) as the material being scanned are copied hierarchically inside the current scanspace if they are not larger than the small object size threshold. Larger objects, and all objects destined for the other (tenure or survivor) space, are copied breadth-first to outside copyspaces.

Each scanspace on the stack is scanned in this manner until one of two conditions is realized. If the scan head reaches the copy head the scanspace has been completely scanned and can be popped from the scan stack. If the copy head reaches or passes the limit then further inside copying is prohibited and each object reference contained between the scan and copy heads must be pushed up the stack. In most contexts the maximal depth of the stack will not exceed a small number of frames, but in the presence of long linked structures it may grow up to the stack limit. In that event, the objects referenced from the top stack frame are flushed to outside copyspaces, regardless of size, until the stack pops and pushing can resume.

The presence of stalled evacuator threads indicates that all evacuator worklists are empty. Each time the scan stack is popped, the evacuator samples the controller's stalled evacuator bitmap. If there are any stalled threads, the evacuator will set its stack limit to point to the frame being popped, to force the remaining content of the next frame below to be flushed to outside copyspaces, thereby generating shareable work. The stalled bitmap is checked as each frame is flushed and popped until no evacuators are stalled or the flushing evacuator empties its stack and stalls. When the stalling condition is cleared, all running evacuators resume normal stack operation.

When the bottom frame is popped from the stack the evacuator strips any and all work in its outside copyspaces into work packets and puts them on its worklist to be loaded into the stack and processed as above. Both outside copyspaces must be cleared before the evacuator can look for work to pull from other evacuators. When both outside copyspaces have been cleared and its worklist is empty the evacuator may take the controller mutex and scan the array of evacuators to locate the evacuator holding the greatest volume of work and will pull work from the maximal evacuator's worklist until they have approximately equal volumes of work. It will then release the controller mutex and either continue with found work or, if no work was found, stall and wait to be notified of available work or completion of the main heap scanning phase. If it finds work, its bit is set in the resuming evacuators bitmap to enable it to escape from the stall loop and continue scanning after loading the found work into its stack. The heap scan completes when the last running evacuator thread stalls, releasing all evacuator threads to continue to the clearing phase. At that point the aggregate copied and scanned byte counts are compared and the controller asserts if they are not equal (unless the scan was aborted and backout is required).

To begin the clearing phase, the evacuator delegate is queried to determine whether there are any clearable objects to be evacuated. Depending on the client language runtime, one or more clearing phases may be required. Each clearing phase proceeds similarly to the main phase. A set of clearable objects, identified by the evacuator delegate in its implementation of MM_EvacuatorDelegate::scanClearable(), are copied breadth first into evacuator copyspaces and worklists using the same object copying methods used during root and remembered set scanning. Objects in evacuation space reachable from these objects are then scanned in MM_Evacuator::scanHeap(). There is a slight wrinkle in the OpenJ9 Java implementation of the evacuator delegate, in that the delegate is required to call scanHeap() directly. This mode of interaction is deprecated in the evacuator model. The recommended approach is to have the evacuator delegate simply identify and copy the initial set of clearable objects in each clearing phase, in its scanClearable() implementation. In MM_Evacuator::workThreadGarbageCollect() the requirement for dependent scanning in scanHeap() is determined by comparing the aggregate copied and scanned byte counts after each call to scanClearable().

Evacuators attempt to limit whitespace allocation sizes when they sense that the heap scan is close to completion. They are able to determine this somewhat reliably on the basis of the aggregate rate of copy production, which is sharply reduced toward the end of the heap scan as evacuator threads encounter object slots referencing objects that have already been copied, and other factors. More work is planned in this context, as applications that typically present very small live sets to each generational gc cycle are relatively insensitive to changes in the aggregate rate of copy production.

ktbriggs commented 6 years ago

This was getting a bit stale. Rebased on eclipse/master and pushed.

@charliegracie can we get an update on this?

ktbriggs commented 5 years ago

[The notes in this comment supersede the original design documentation in the previous comment]

[See also #2664].


A new nursery evacuation algorithm for generational garbage collection has been developed and put forward for review and comments from the OMR community. Evacuator was developed iteratively and tested and benchmarked after significant milestones to verify design and to demonstrate its simplicity and superior performance relative to scavenger, the production generational garbage collector for OpenJ9 Java. Below is an updated outline of evacuator design and a review of benchmarking results obtained over the course of development, intended to promote discussion within the OMR community.

Evacuator and scavenger are implemented in OMR and consumed by OpenJ9 Java. Stable source code for evacuator is being maintained in the evacuator branch here: https://github.com/ktbriggs/omr/tree/evacuator. Source for my work-in-progress branch evacuator-wip, which is used for benchmarking, is here: https://github.com/ktbriggs/omr/tree/evacuator-wip. Artifacts required to integrate evacuator with OpenJ9 Java are to be found here (stable): https://github.com/ktbriggs/openj9/tree/evacuator and here (work in progress): https://github.com/ktbriggs/openj9/tree/evacuator-wip.

At time of writing the only difference between OMR evacuator and evacuator-wip branches is in EvacuatorBase.hpp:

-#undef EVACUATOR_DEBUG_ALWAYS
+#define EVACUATOR_DEBUG_ALWAYS

The EVACUATOR_DEBUG_ALWAYS macro enables instrumentation of scavenger and evacuator and summary reporting of instrumentation metrics after each nursery collection. These metrics are output to stderr and are the source of the benchmarking results reported in #2664 and below (see Benchmarking).

To build a particular Java version with evacuator using the stable branch, go to https://github.com/eclipse/openj9/tree/master/doc/build-instructions and select the build instructions for the version that you wish to build. In the build instructions, substitute

bash ./get_source.sh -openj9-repo=git@github.com:ktbriggs/openj9.git -openj9-branch=evacuator -omr-repo=git@github.com:ktbriggs/omr.git -omr-branch=evacuator

for

bash ./get_source.sh

and you should be good to go, the rest of the build instructions should apply verbatim. It may be necessary to rebase the OMR and OpenJ9 evacuator branches if they haven't been rebased recently. In that case, you can rebase the OMR evacuator branch onto the openj9 branch of the eclipse/openj9-omr repo and the OpenJ9 evacuator branch onto the master branch of the eclipse/openj9 repo. If you encounter any difficulty feel free to contact me in the OMR workspace on slack.

To obtain source for an instrumented build use get_source.sh as above, replacing evacuator with evacuator-wip.

To run OpenJ9 Java with evacuator replacing scavenger as nursery collection algorithm, specify -Xgc:recursiveScanordering and evacuator will run with default operational parameter values. Operational parameters that can be selected for evacuator include:

I welcome constructive and pragmatic critique and recommendations and would be pleased to guide source code reviews for interested parties.

Background

The standard OMR generational collector, scavenger, is the default and generally the most performant of all OMR/OpenJ9 collectors. Evacuator was developed off-line as proof of concept to verify an idea relating to the direction of copy during generational collections, that is, the movement of objects from evacuation space to survivor and tenure regions.

At bottom the generational collection task is a straightforward process whereby all live objects in the evacuation semispace of the nursery are evacuated to survivor semispace of the nursery (young objects) or to tenure space (old objects), as illustrated below. This is achieved by first pausing all mutator threads and evacuating a root set of objects obtained from mutator thread stacks and on-heap class and system metadata and then recursively scanning each evacuated object to locate references to objects remaining in evacuation space and evacuate the referent objects.

image

Scavenger uses a common data structure called a scancache that identifies a contiguous block of RAM ranging from (default values) 8K to 128K in size. Each scancache maintains base and end pointers as well as a scan head and a copy head. Evacuated objects are copied sequentially into the scancache until it becomes too full to accommodate additional objects. When a scancache is completed it is then enqueued for scanning, which entails iterating over reference-valued fields within each copied object to locate references to other objects requiring evacuation and evacuating these. Thus the scancache represents for scavenger both the unit of survivor/tenure memory allocation for copying and the unit of parallelization for scanning.

Each scavenger thread maintains three scancaches -- one of these is the active scancache that is currently being scanned. The others are survivor and tenure scancaches that receive objects that are being evacuated as the scan proceeds. In this breadth-first modality, scavenger achieves excellent object localization since the referents associated with any evacuated object are almost always copied sequentially into a contiguous address range, which tends to reduce hardware L3 cache misses and provides a bit of a boost to mutator performance.

Around 2005, a new hierarchical scanning modality was introduced in scavenger. With hierarchical scanning, scanning and copying occur within the same scancache. This technique is referred to as aliasing, and is intended to increase the likelihood that referent objects are copied close to the referring pointer. This is particularly advantageous for small objects containing a single referent pointer, such as Java String which spans only 16 bytes and contains a pointer to a char[] array. This is advantageous to the collector and to the application when the referring pointer and copied referent are collocated within the same cache line but of doubtful value otherwise.

The likelihood of cache-line containment for reference arcs decreases with increasing distance between the scan and copy heads of an aliased scancache. To address this, scavenger's aliasing algorithm attempts to minimize this distance by opportunistically switching alias between survivor and tenure scancaches so that it is always aliased to the scancache with the least distance between scan and copy heads. This is effective as long as both survivor and tenure scancaches have available scan work and the scan-copy distances in both caches are not much greater than the size of a cache line, but most nursery collections run to completion tenuring a relatively small volume of material and, in any case, the scan-copy distances in active aliased scancaches grow quickly, especially in large scancaches.

Shortly after hierarchical scanning was implemented in scavenger it was noticed that it resulted in suboptimal parallelization. This occurs because scavenger threads that are scanning an aliased scancache consume (scan) their own work product (copied objects) and produce very little work that can be distributed to other threads. To ameliorate this effect, a throttle was introduced to inhibit aliasing whenever one or more scavenger thread was stalled and waiting for work. In that case, active threads would complete current scancache if aliased but then be prevented from further aliasing and forced to operate in breadth-first mode as long as the stall condition remained. While this effectively improved parallelization, it reduced cache-line containment to such an extent that the percentage of objects evacuated with cache-line containment is typically <10%.

With current scavenger, these effects can be seen by limiting the minimum and maximum sizes for scancaches (setting -XXgc:scanCacheMinimumSize = X = -XXgc:scanCacheMaximumSize) and measuring the proportions of objects copied within 4kb (C<P) of referring object, more than 4kb from referring object (C>P), and to a different scancache (C>C) using 8/8K and 32/32K min/max scancache sizes. Also measured were CPU utilization, the number of survivor/tenure RAM allocations and the number of bytes scanned between alias switching. Benchmarking on a 128-core Power8 machine (limited to 8 cores for these runs) produced these results:

[8/8K] (C<P; C>P; C>C) = (0.507; 0.110; 0.382) (92% CPU utilization, 348 cache allocations/ms, average 309 bytes scanned between alias switching)

[32/32K] (C<P; C>P; C>C) = (0.379; 0.149; 0.472) (88% CPU utilization, 118 cache allocations/ms, average 547 bytes scanned between alias switching)

The 8/8K runs had better CPU utilization but much lower throughput than the 32/32K runs, probably because more time was spent waiting on the allocation locks and managing the increased volume of active scan caches. Additionally, the 8/8K runs were limited by more frequent context switching overhead associated with switching alias between caches. The 8/8K runs achieved a higher proportion of copy within 4kb operations because there were more copy caches available to copy into and the small cache size limited useless copying with >4kb scan-copy distance within aliased scancaches.

Another limiting factor in current scavenger design is the tight coupling of scan and copy concerns within the scancache structure. On one hand, the main concern with regard to copying is the allocation of whitespace -- that is, allocation of survivor or tenure memory to copy evacuated material into. On the other hand, the main scanning concern is to obtain copied material to scan. Tight coupling of these concerns within the scancache structure prevents whitespace allocation sizes from being determined independently of work packet size. Copying favors the allocation of large address ranges to copy into, except at the end of each heap scan where large allocations leave substantial unused fragments that are then inaccessible to the application (macro fragmentation). Scanning also favors large address ranges, but needs to be adjusted dynamically as required to ensure adequate distribution of work between scavenger threads throughout the gc cycle. Because these concerns are conflated within the scancache structure, scavenger must reduce allocation size in order to reduce work packet size and produce distributable work units as quickly as possible.

The scancache structure is also overloaded with a passive object scanner instance that consumes additional space (~160 bytes) within queued scancaches, which increases the amount of system memory that must be consumed to accommodate the backlog of scan work.

Evacuator Design

Evacuator design aims to address some of these concerns by replacing the aliasing algorithm with a stack-based algorithm that limits close copying to objects of size not grater than the maximum inside copy size (default 256 bytes) and constrains the maximum distance between frame base and copy head for inside copying (default 4096 bytes). The thresholds for small object size and maximum close copy distance are controllable parameters that can be preset through command-line options and/or adjusted dynamically in response to related runtime factors. Allocation sizes for copy concerns and per-thread thresholds for releasing scan work to be available for consumption by other gc threads are also controllable and can be adjusted dynamically and independently per evacuator thread. Instrumentation, primarily relating to the rate of copy production relative to rate of scanning progress and the presence of stalling gc threads, produces runtime feedback that can be used to alter the controllable parameters to obtain even and continuous distribution of work and reduce fragmentation of survivor and tenure space.

A sketch of evacuator design is provided here as guidance for reviewers and developers who might be engaged in further development, if warranted. The core elements include a bounded scan stack of scanspaces wherein hierarchical copying of small objects is enabled and a pair of copyspaces that receive larger objects and objects flushed from the scan stack to produce distributable work. In this presentation, the term 'inside copy' refers to copy that is directed within a scan stack frame (analogous to scavenger's aliased scanning modality) and 'outside copy' refers to copy that is directed to outside (survivor or tenure) copyspaces (analogous to breadth-first copy in scavenger), respectively.

A third copyspace, called the large copyspace, receives objects that overflow either of the outside copyspaces, These objects are always >256 bytes in size, and are typically arrays. The contents of the large copyspace are always released to the worklist as soon as copy is complete, so the large copyspace is always effectively empty.

All scanning takes place within stack scanspaces. Normally, small objects are copied inside the scanspace being scanned and larger objects are directed to the outside copyspaces. Under certain conditions, evacuator threads may inhibit inside copying and direct all objects to outside copyspaces. These conditions include the presence of stalled evacuator threads that have run out of scan work, overflow of large objects into outside copyspaces that have reduced capacity to receive objects, and stack overflow. Scan work that accumulates in the outside copyspaces is released as workspaces and enqueued on evacuator worklists. A dynamic threshold for releasing work from outside copyspaces grows with increasing volume of worklisted workspaces.

Evacuator workflow

Evacuator uses a collection of classes and structures that encapsulate segments of contiguous survivor and tenure memory for specialized purposes. These are illustrated in the figure below. From the bottom up, whitespace objects have a base pointer and a length and represent memory that has been reserved but is not presently incorporated into a copyspace or scanspace. A copyspace has base and end pointers and a copy head and represents a span of whitespace that is actively receiving copied material. A scanspace extends a copyspace with a scan pointer and space for an object scanner (object scanner not shown in figure). A workspace has a base pointer, length, offset, and a next pointer for linkage into worklists and freelists. Scan work (light grey) and whitespace can be transferred easily between same or different kinds of spaces using simple pointer transforms.

image

All scanning is performed within a bounded stack of frames each comprised of a single scanspace. The number of frames is determined by a command-line option -XXgc:recursiveMaximumStackDepth. The default value, 32, is sufficient in most cases to contain non-recursing maximal chains in the object reference graph. Breadth-first scanning is obtained by setting the maximal stack depth to one. The width of each stack frame is specified by another command-line option -XXgc:recursiveMaximumInsideCopyDistance, with default value 4096 bytes, that determines the frame limit. Depth-first scanning is obtained when the frame width is set to zero. When scanning within a frame, referents within the object at the scanspace scan head are copied to the location pointed to by the scanspace copy head, similarly as for copying within scavenger's aliased scancache. However, inside copying is prohibited after the copy head meets or passes the frame limit. After the limit is reached, each referent object between the scan and copy head must be pushed up to the next stack frame or copied to an outside copyspace if the stack has reached maximal depth. Scanning and copying on the stack is illustrated in the figure below.

image

Each evacuator thread maintains three outside copyspaces, one to receive objects being evacuated to survivor space, one to receive tenured objects, and one to receive large objects that overflow the intended (survivor or tenure) copyspace. Objects larger than the minimum copyspace size (default 8192 bytes) are always copied into the large object copyspace and immediately released as workspaces. The survivor and tenure copyspaces receive objects that cannot be copied inside stack frames, whether because they are too large or the stack has overflowed or all copy is being directed outside owing to a flush condition that has been raised. Flush conditions, which force all objects, small or large, to be copied outside, include stack overflow, copyspace overflow (ie, remaining whitespace too small to receive an object but too large to discard), and the presence of stalled evacuators. Inside copying is permitted only when all flush conditions have been cleared.

The material that accumulates within outside copyspaces is scan work, and can be released as a workspace to be enqueued on the evacuator's worklist when the accumulated volume of work exceeds a work release threshold. This threshold is dynamically reduced to a minimal value in the presence of stalled evacuators. Otherwise its value is set to the current volume of work pending on the evacuator worklist, clipped if required to be constrained within minimum and maximum values. The default minimum work release threshold is 4096 bytes but this can be overridden using the -XXgc:recursiveWorkVolumeQuantum command-line option. The effective maximum work release threshold is 16 times the minimum quantum size. Any object copied into the large object copyspace is released immediately as one or more workspaces. The large object copyspace then cleared. Scalar objects and pointer arrays with <512 elements are released from the large object copyspace as a single workspace and large pointer arrays are segmented and released as multiple workspaces for concurrent scanning. Unlike scavenger, which splits pointer arrays serially as they are scanned, evacuator splits arrays into workspaces spanning up to 512 array elements immediately after copy and these are all available for distribution and concurrent scanning after the large object copyspace is cleared.

The workspace is the unit of work distribution and parallelization. Each evacuator thread maintains a local FIFO worklist of workspaces as a backlog of scan work. Each worklist is protected by a mutex and an evacuator controller maintains a distribution bus that enables evacuators that have run out of scan work to probe and pull workspaces from other evacuator threads after taking exclusive control of the bus. The work distribution protocol selects the evacuator with the greatest volume of scan work on its worklist as the prime evacuator to pull work from and tries to take that evacuator's worklist mutex. If successful the stalling evacuator thread will pull up to, but not more than, half of the total volume of work from the donor worklist -- this feature tends to level the workload across evacuator threads. If unable to take the worklist mutex for the prime evacuator the stalling evacuator will walk around the bus at most once attempting to take the worklist mutex from each active evacuator in turn until it receives some work or completes the circuit without work and stalls, waiting until an active evacuator adds work to its worklist and notifies the controller.

Evacuator threads are very conservative of whitespace and each evacuator maintains two whitelists to retain whitespace fragments released from the tail ends of survivor and tenure outside copyspaces that have been almost completely filled. Each whitelist is arranged as a priority queue with a capacity of 15 fragments that always presents the largest fragment at the head of the queue. These fragments are consumed by objects that overflow outside copyspaces and may also be used to refresh outside copyspaces and stack whitespace if larger than the minimal TLH size. Whitespace fragments <256 bytes in size are always discarded (filled with holes to keep the heap walkable).

For outside copyspaces, the threshold for trimming whitespace is 256 bytes -- if an object of size N > 256 bytes is presented to a copyspace that has <N bytes of whitespace remaining the object is redirected into appropriate stack whitespace or, failing that, into the large object copyspace, and a flush condition is set to redirect all objects small or large to the overflowing outside copyspace. Additionally an overflow threshold T is set at the minimum workspace size and subsequently reduced by the size of each object that overflows the copyspace. If T is reduced to 0 or if the copyspace fills and its whitespace remainder shrinks to <256 bytes before T is reduced to 0 the accumulated work in the copyspace is released as a workspace, the remaining whitespace is released to the appropriate whitelist, and the copyspace is refreshed with a new whitespace allocation.

Similarly, whitespace remainders from scanspaces are constrained to be <32 bytes in size. Each evacuator maintains two stack whitespace allocations, one for survivor and one for tenure. Stack whitespace is always pulled into the receiving scanspace before an object is pushed into the scanspace stack frame and remains attached to the spent frame when the stack is popped. Two scanspace pointers track the location of the survivor and tenure stack whitespace. If remainder stack whitespace scanspace is not <32 bytes small objects that would normally be copied inside the scanspace are directed to the corresponding outside copyspace until one or more small objects are received into the scanspace, shrinking the whitespace remainder to <32 bytes. The remainder stack whitespace in the exhausted scanspace is then discarded and the frame limit is set to the last position of the copy head, forcing push for all referent objects remaining between the scan and end pointers, and the stack whitespace is refreshed on the next push.

Any whitespace remaining in evacuator scanspaces or copyspaces is trimmed to the respective whitelist when an evacuator thread completes at the end of a nursery collection. The contents of the survivor whitelist are then recycled back into the survivor semispace. The tenure whitelist contents are retained for use in the next nursery collection unless a global collection intervenes, in which case they are recycled back into tenure space before the global collection begins.

The figure below illustrates the evacuator workflows described here.

image

All remembered objects in the tenure region that hold references to objects in the nursery are pushed individually onto the scan stack and recursively scanned at the start of a nursery collection. After remembered set scanning is complete objects in evacuation space that are referenced from root objects are similarly pushed onto the stack and scanned. The outside copyspaces accumulate work during remembered and root set scanning and workspaces released during this phase accumulate on evacuator worklists, building a backlog of distributable scan work before the main scanning phase. After remembered and root set scanning has been completed, evacuator threads enter the main scanning phase. During this phase the workflow protocol requires that each evacuator must repeatedly pull onto the scan stack and scan all work accumulated in outside copyspaces before attempting to pull work from its own worklist, and may attempt to take control of the work distribution bus and pull work from other evacuators' worklists only after it has cleared both outside copyspaces and its own worklist is empty.

Active evacuators periodically report to the controller the number of bytes copied and number scanned since last report, and the controller aggregates this information to track the overall progress of the collection cycle. The nursery collection is complete when all evacuator instances have stalled. At that point, unless an abort condition has been raised, the controller will assert and fail if the aggregate number of bytes copied is not equal to the aggregate number of bytes scanned (this would signal a defect in the evacuation algorithm). If an abort condition has been raised, it is because an evacuator thread has tried all possible means to allocate whitespace to satisfy a copy operation and failed, which would indicate that survivor and tenure space are exhausted and a global collection is required. An abort condition will force a backout (reversal) of the nursery collection after all evacuators have recognized the abort condition and stalled. A global collection is triggered is then triggered when the backout completes.

Following a successful nursery collection, the survivor semispace of the nursery is expanded into the now empty evacuation semispace. The two semispaces are then flipped so that the truncated evacuation space is reserved as the next survivor semispace and the expanded survivor space becomes the allocation space for the mutator, which resumes operation when the nursery collection completes.

Evacuator control

An evacuator controller MM_EvacuatorController maintains an array of pointers to evacuator MM_Evacuator instances. The size of the array is determined by the total number N of GC threads available in the GC thread pool when the controller is instantiated during runtime initialization, but no evacuator instances are created until the first nursery collection. At that time, the parallel dispatcher will dispatch some number TN of GC thread. Each GC thread runs the nursery collection task MM_EvacuatorParallelTask, which calls on the evacuator controller to bind an evacuator instance to the thread and encapsulate the thread environment MM_EnvironmentBase within the evacuator instance. The bound evacuator then proceeds to call its workThreadGarbageCollect() method to begin participation in the nursery collection cycle. The first evacuator instance that is bound will acquire the controller mutex and inform the controller of the number T of evacuator threads that will be activated for the collection, as this quantity is stable but unknown to the controller prior to this point. While it retains the controller mutex this evacuator may also force the controller to reduce the configured maximal thread local heap (TLH) whitespace size if the size of survivor space is very small. In that case the maximal TLH allocation size M is reduced to satisfy M < S/4T, where S is the size of the survivor semispace.

During the nursery collection the controller mediates all whitespace allocations and maintains operational upper bounds on whitespace allocation sizes. The default configured maximum size for a TLH whitespace allocation is 128kb but this may be overridden on the command line using the -Xgc:tlhMaximumSize option. The minimum TLH whitespace allocation size is 1/16 of the maximum size. The operational upper bound is initially set to the configured maximum TLH allocation size but is decreased in stepwise increments if any TLH whitespace allocation request fails, in which case the allocation request is retried at the reduced upper bound. The operational upper bound is then applied to all TLH whitespace allocations for all evacuator threads for the remainder of the collection or until it is again lowered due to allocation failure.

Allocations of TLH whitespace from the tenure region for outside copyspace or stack scanspaces are always attempted at the operational maximum size, which is reduced only when an allocation fails at the current operational maximum, because tenure whitespace remainders at the end of each nursery collection can be retained on evacuator whitelists and reused in the next nursery collection. Survivor whitespace cannot be retained between cycles, so survivor whitespace allocation sizes are reduced in nonlinear proportion to the number of stalled evacuator threads when the volume of material copied into survivor space is greater then 80% of the total size of survivor semispace.

In addition to TLH whitespace allocations, the controller mediates evacuator requests for object whitespace allocations as required for objects that overflow outside copyspaces that have nondiscardable whitespace remainders too small to receive them. Operational limits are maintained for object allocations from tenure and survivor space and are set to the size of the most recent object allocation failure, if any, to prevent repeated attempts to allocate sizes that are sure to fail.

The controller maintains the work distribution bus, which is essentially a ring of active or stalled evacuator instances. Each evacuator instance exposes a volatile value representing the current volume of its worklist, which is the sum of sizes of workspaces enqueued for scanning. Any evacuator that has cleared its outside copyspaces and emptied its local worklist can acquire exclusive access to the bus by acquiring the controller mutex. The evacuator will then make a circuit over the bus to identify the active evacuator with the greatest volume of work, which is the prime donor. It will then try to enter the worklist mutex of the prime donor and, if successful, will pull up to half of the donor's volume of work onto its own worklist. If not able to obtain the prime donor's worklist mutex the evacuator will simply continue in order around the bus until it is able to pull work from another active evacuator, or complete the circuit without obtaining work and stall.

Each evacuator maintains its own threshold for releasing workspaces from its outside copyspaces into its local worklist, where it becomes available for distribution to other evacuators. The threshold value is updated whenever the volume of work on the evacuator's worklist changes, and the value is determined by the controller. At present, the controller sets this threshold to the minimum workspace size (1/16th the -XXgc:recursiveMaximumWorkQuantumSize) if there are any stalled evacuators. Otherwise, the threshold is set to the current volume of work in the evacuator's worklist and trimmed if required to be within the acceptable range. In this way the threshold increases exponentially up to the maximum workspace size as long as the evacuator is accumulating work in its worklist.

Active evacuators periodically report the number of bytes copied and bytes scanned since the start of the nursery collection or last report. Separate copied byte counts are maintained for survivor and tenure and scanned bytes are summed into a single count of copied bytes. The granularity of the reporting interval is computed at the start of each nursery collection (when the first evacuator instance is bound to the collection) as S/64T, where S is the size of survivor semispace and T is the number of evacuator threads that are bound to the nursery collection. Evacuator instances report and reset their copied and scanned byte counts whenever the the sum of copied bytes (survivor+tenure) or number of scanned bytes exceeds the reporting threshold.

The reporting protocol was originally implemented to allow the controller to maintain a current view of the aggregate volume of unscanned work available in the system and use this information to anticipate the end of the nursery collection and reduce TLH whitespace sizes in advance of completion. This proved to be impractical because this metric does not distinguish unscanned work on evacuator stacks and unscanned work on worklists, and because most evacuator instances run with very low worklist volumes during the final stages of the nursery collection. These counters are retained because they provide a valuable check on the correctness of the evacuation algorithm -- the invariant Cs+Ct == Sc+t, where Cs+Ct and Sc+t are the aggregate number of bytes copied and scanned, is asserted by the controller at the end of each successful nursery collection.

Benchmarking

The main metrics that evacuator is designed to improve are nursery collection throughput (kilobytes evacuated per millisecond, kb/ms) and percentages of objects evacuated with cache-line containment of referring pointer and referent head for 64 (x86_64), 128 (PowerPC), and 256 (Z-series) byte cache-line sizes. Other metrics of interest that were tracked for scavenger and/or evacuator during development included total and average duration of nursery collections, CPU utilization by GC threads, number and distribution of sizes of whitespace allocations, number and distribution of sizes (evacuator only) of workspaces pulled from worklists, and percentage of material copied and scanned inside stack scanspaces (evacuator only). Workspace size distribution is difficult to measure for scavenger as its scancaches can be deferred and pulled from scavenger worklists multiple times. This also precludes accurate measurement of the proportion of work that scavenger performs locally, that is, without drawing from worklists, which would be comparable to evacuator's measure of material copied and scanned inside stack scanspaces.

Instrumentation output

Instrumented evacuator builds produce summary metric data after each nursery collection. The format and content of this output are described here. Typical output looks like this:

   79      :    gc start; survivor{e4a00000 ea220000} tenure{5450000 b240000} evacuate{ea220000 0}; threads:8; projection:5820000; allocation:20000
   79      : contained; 197814 207737 270607 1679282
   79      : cachesize; 8 6 4 3 2 3 1 4 0 5 3 3 4 5 1 514 42e32a8
   79      :  worksize; 102 141 41 24 37 9 10 5 13 4 11 8 6 3 5 101 f317b0
   79      :     small; 0 641800 303234 55878 372088 52044 2730 210343 1203 1332 2151 2702 4224 7366 443 300 472 106 122 53 218 306 182 149 270 172 333 184 157 245 202 350 53025576
   79      :     large; 6340 448 10608 308 184 27 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15730808
   79      :     stack; 494829 247156 56497 13185 9896 8254 7051 6525 5543 4618 3944 3730 3040 2322 1855 1563 1396 1252 1083 1032 843 798 662 609 461 308 304 255 267 195 110 61 879644
   79      :    idle time; 318 8 8 1956 805 5601 36559
   79  0   : end cycle; 41923a0 41923a0 3260bf0 72d6b0 22a8 1d70 13cb00

gc start shows the heap region boundaries (truncated at 32-bits), number of participating GC threads (decimal), projected evacuation volume in bytes (hex), and TLH whitespace allocation upper bound.

contained shows the number of objects evacuated with 64-, 128-, and 256-byte cache-line containment with referring pointer and the number of objects evacuated (all values decimal).

cachesize shows the number (decimal) of object or TLH whitespace allocations per size bin. Each bin size is 1/16th of MM_GCExtensionsBase::tlhMaximumSize. The last value (hex) is the sum of all whitespace allocation sizes.

worksize shows the number (decimal) of workspaces pulled from evacuator worklists. Each bin size is equal to MM_GCExtensionsBase::evacuatorWorkQuantumSize. The last value (hex) is the sum of all workspace volumes. These values are exact for evacuator but overestimate scavenger work sizes because work ite4ms (scan/copy caches) may appear more than once on scavenger worklists.

small shows the number of objects of size N for 0≤N≤256 and the sum of their sizes. All values are decimal and are exact.

large shows the number of objects of size between log2(N) and log2(N+1) for 8le;N;≤32 (the first bin excludes objects of size 256). All values are decimal.

stack shows number of stack frame activation counts for evacuator runs and the sum of activation counts over all frames. All values are decimal. This does not apply for scavenger runs.

idle time shows total times GC threads were stalled, waiting for other threads to complete scanning, or waiting to synchronize and the duration of the nursery collection. All values are in microsecond units and decimal.

end cycle shows the total number of bytes copied (survivor or tenure), scanned, copied inside stack frames, tenured. The last three values show the number of bytes of whitespace discarded (as holes) during the collection, discarded at the end of the collection, or recycled back into the memory pool at the end of each collection. All values are hexadecimal.

Benchmarking workloads

Evacuator development has been iterative, and a standard series of benchmarks have been run after each major milestone in order to track these performance metrics. Three standard benchmarks were run at each stage:

These benchmarks were executed in 5x2 runs, interleaving 5 runs with scavenger as nursery collection algorithm with 5 runs with evacuator on the same machine. This set of benchmarking trials were executed at various stages of evacuator development and run configuration details and results have been reported in #2664. The most comprehensive set of benchmarking trials were conducted in late autumn 2018 and included a full set of 5x2 runs as described above for each of these three benchmarks on three different computing platforms (x86_64, PowerPC 8, Z-series, each limited to 8 active CPUs) running different ports of Linux. In additional to scavenger vs evacautor with default operational parameters (256/4096 bytes maximum inside copy size/distance), a full set of 5 runs were executed for each benchmark with evacuator configured for 256/0 and 131072/0 (depth-first scanning) operation. Results from these trials are included in #2664 (see comment from Nov 14, 2018). The results reported there incorporate earlier scavenger results for the same benchmarks; these data could be aggregated with the more recent scavenger results because the scavenger algorithm had not received any updates impacting performance over the time that the data were collected.

Each of these benchmarks produces a single score at the end of each run, but the within-group variance for these scores for any given set of 5 runs is too high to support any definite statement regarding differences in average benchmark scores for evacuator vs scavenger. The differences between mean benchmark scores for scavenger runs vs evacuator are typically 2-3% and within the variance range so no definite conclusions can be drawn from any one set of 5x2 trials; however, evacuator tended to produce superior benchmark scores on each set of trials over the course of evacuator development.

Subsequent to the November trials, further development of the evacuator algorithm was focused on reducing stack overhead for depth-first scanning, developing more conservative mechanisms for handling outside copyspace overflow and reducing concomitant fragmentation, and reducing the size of compiled evacuator code through more judicious inlining of methods. For benchmarking runs with the final (capped) build of evacuator, an additional benchmark was introduced:

The enterprise benchmark was designed to produce scores with low variance and these scores may be given more weight, but the runtime duration limits the number of independent benchmark runs that can be undertaken. It produces two scores after each run, labeled in #2664 as score-A, which is sensitive to transactional throughput, and score-B which is sensitive to GC throughput. This benchmark was run twice for scavenger and for each of 4 different evacuator configurations with varying values for maximum inside copy size and distance. All runs were executed on a single NUMA node with 30 active CPUs on an x86_64 platform running Linux. The results are also included in #2664 (see comment from January 31, 2019).

Benchmarking results

The figure below presents benchmarking results for scavenger (hierarchical scanning modality) and evacuator in default and depth-first operating modalities using a recent build (January 30 2019). The transactional, data compilation and database processing benchmarks were run 5 times for each of scavenger and evacuator (interleaved) on a single NUMA node with 8 x86_64 CPUs. The enterprise benchmark was run twice for each of scavenger and evacuator (interleaved) with a 56g heap (15.5g nursery) on a single NUMA node with 30 x86_64 CPUs. No other user processes were running at any time during any of the benchmark runs.

These results, together with results from 2 more full sets of benchmarking runs as described above, are included in the final comment in #2644.

image

Discussion

Evacuator development was undertaken on the premise that scavenger's modality of hierarchical scanning and copying within aliased scancaches increases the likelihood of cache-line containment of referring pointers and referent objects and thereby reduces level 3 cache misses and improves mutator performance. Measurement of actual cache-line containment for objects evacuated in scavenger/evacuator benchmarking trials showed that scavenger typically obtained <10% cache-line containment, and it is not clear that this premise is supportable notwithstanding recent changes to scavenger intended to improve this metric. One reason to doubt this premise is that reference arcs dissolve when either endpoint drops out of the live set and most objects die in the nursery. Also, the odds are against cache-line containment for any given reference arc surviving successive nursery collections even if both endpoints survive, because there is no guarantee that the parent endpoint will be scanned in the same context in successive collection cycles. Finally, neither scavenger nor evacuator have any information relating to the frequency with which any given reference arc is traversed while the mutator is running. Benchmark scores from evacuator benchmark runs in depth-first scanning configurations suggest that increasing cache-line containment actually impairs mutator performance (this may not be the case with larger cache-lines such as Z-series machine with 256-byte cache-lines -- see benchmarking results for Linux/Z-series trials in #2664).

Shortly after hierarchical scanning was introduced to scavenger a pronounced regression in parallelism was observed. Scavenger threads that are operating on an aliased scancache consume (scan) much of their own product (evacuated objects) and produce relatively little distributable work. To solve this, a check was introduced to inhibit aliasing whenever one or more scavenger threads were stalled. A similar effect was observed at an early stage of evacuator development, before the work distribution bus was introduced -- evacuator threads had no means to distribute work but were able to achieve aggregate throughput comparable to scavenger. This suggests that scavenger hierarchical scanning and evacuator recursive scanning within limited stack frames are inherently more efficient than breadth-first scanning, because active threads are able to discover and perform much more scan work locally, without repeatedly pulling work from a shared work pool.

Some support for this can be obtained from an evacuator metric (%inside) reported for some of the benchmarking trials in #2664 (see values for this metric in comment from March 26, 2019). For evacuator, this is an accurate measure of the percentage of scan work that is discovered by recursively scanning objects on the stack or pulled from local outside copyspaces, as opposed to scan work pulled from evacuator worklists. This metric is difficult to measure accurately for scavenger, because scancaches can be deferred, pushed to and pulled from worklists multiple times -- the %inside values reported for scavenger underestimate the amount of scan work that scavenger discovers as it switches between aliased scancaches. The table below summarizes the most recent trial results for evacuator. Here, benchmark scores tend to improve with increasing percentage of objects copied inside stack frames (%inside) and regress with increasing 64b cache-line containment (%<64b). The %small statistic is the percentage of total evacuated volume copied in objects of size ≤256b, which is the default value for -XXgc:recursiveMaximumInsideCopySize and the effective value for these trials. This parameter limits the size of objects eligible for copying inside evacuator scan stack frames. The score statistic is the percentage difference between benchmark scores 100*(evacuator-scavenger)/scavenger.

benchmark %small %inside %<64b score
enterprise 87.084 84.095 34.068 4.082%
transactional 96.74 89.886 6.158 3.729%
data compilation 56.66 58.712 21.417 3.729%
database processing 58.87 43.391 13.452 1.432%

Conclusions

It is likely that a number of factors interact to influence mutator performance. Improved evacuation throughput produces shorter collection pause times for equivalent volumes of evacuated material and yields greater processing bandwidth for the mutator, which tends to achieve better results overall. In view of the relatively low benchmark scores for evacuator runs with depth-first scanning (maximum inside copy distance 0), it is clear that the heap layout obtained by depth-first scanning, which tends to copy the first referent object adjacent to the referring object and separates siblings by their expressed recursive volume, is suboptimal for these benchmarks even though it markedly improves cache-line containment.

High percentages of objects evacuated with 64-byte cache-line containment, such as are obtained with evacuator depth-first scanning, tend to result in no apparent improvement or even regression of benchmark scores. Regression may occur as a result of suboptimal layout of objects obtained with hierarchical or recursive scanning since these modes tend to separate sibling objects by their expressed recursive volume whereas breadth-first copying places most siblings in immediate proximity to each other. Only on Z-series machines, with 256-byte cache lines, did this metric seem to positively influence benchmark scores.

Improved evacuation throughput for scavenger's hierarchical scanning mode vs breadth-first scanning and for evacuator recursive scanning mode vs scavenger hierarchical is probably due to increased locality of workload. Scavenger threads scanning hierarchically are able to discover new scan work as they meander through aliased scancaches, essentially performing a random walk that covers the intersection of the live evacuation set and the reference graph. In that mode, they produce less distributable scan work and rely less on pulling distributable work from the queue of backlogged work, which frequently runs dry. Similarly, evacuator threads discover new scan work as they recurse into objects copied within stack frames and maintain a low volume of distributable work. A correlation between evacuator volumes copied inside stack frames and improvement in benchmark score relative to hierarchical scavenger (%inside and score) is suggested in the table above. More evidence in support of this hypothesis is available in the discussion around the most recent evacuator benchmarking trials in #2664.

I also suspect that more effiective use of the level 3 instruction and data caches contributes substantially to superior evacuation throughput obtained with evacuator. Context switches required to switch from breadth-first to aliased scan/copy caches and to switch alias between caches are more disruptive of cache-lines and more expensive in scavenger contexts, whereas evacuator stack push/pop have relatively little overhead and are less disruptive of level 3 instruction and data caches. Much of this advantage accrues from the fact that the same code is executed in each stack frame and pushing preserves contiguity of whitespace that is receiving pushed copy.

Finally, from a source code maintenance perspective, evacuator design improves substantially on scavenger because it separates at a source level the multithreaded (SIMD) nursery collection concerns from the single-threaded (SISD) core collector concerns, These concerns are conflated within the monolithic MM_Scavenger class (which also contains a relic implementation of breadth-first scanning that could be removed by simply always inhibiting aliasing in hierarchical scanning contexts), together with a large volume of conditionally-included concurrent scavenger source code. Evacuator design extracts the SIMD factors out of MM_Scavenger into a new MM_Evacuator class and supporting SISD factors into a new MM_Controller class that is inserted between the base MM_Collector class and MM_Scavenger. The result a a much cleaner and maintainable expression of the nursery collection algorithm.

All of the above enhancements -- improved locality of work, more effective leveraging of level 3 instruction and data caches, and more readily maintainable source code -- can be obtained with a similar stack-based framework for any algorithm, such as marking, that is based on rooted recursive traversal of the runtime heap. In marking contexts, for example, some benefit would be expected from reduced trafficking in work packets as as well as from more effective use of level 3 caches.

ktbriggs commented 5 years ago

Closing this. See also #2664.