Epic: Evacuator evolution

eclipse / omr

Eclipse OMR™ Cross platform components for building reliable, high performance language runtimes

http://www.eclipse.org/omr

Other

945 stars 396 forks source link

Epic: Evacuator evolution #2664

Closed ktbriggs closed 5 years ago

ktbriggs commented 6 years ago

This is a container for reporting investigation of prospective avenues of further development of the stack-based traversal algorithm developed for evacuator (#2361). Stable source code the evacuator algorithm can be found in the evacuator branch of my public orm and openj9 repos:

https://github.com/ktbriggs/omr/tree/evacuator (see gc/base/Evacuator*.*pp) https://github.com/ktbriggs/openj9/tree/evacuator (see runtime/gc_glue_java/Evacuator*.*pp)

The items below will be developed and benchmarked and reviewed here as evacuator development continues.

Notice: (Jan 29 2019) I am wrapping up work on evacuator for now and have added a Status comment for each of the items below. I will soon squash all commits subsequent to the base OMR Integration commit into one OpenJ9 Integration commit. The former commit has been through the Eclipse CQ process and previously reviewed.

-- interleave survivor and tenure scanspaces on the scan stack

Evacuators currently maintain homogeneous scanspaces on the stack -- once work packet with survivor or tenure scan work is loaded into the bottom frame, subsequent pushes involve only objects copied to the same region and all objects evacuated to the other region are copied to outside copyspaces. This change will allow any small object to be pushed up the stack. This is especially important for remembered tenured arrays of small objects (eg java String). With this change all array elements remaining in survivor space will be copied inside the stack and collocated with their surviving or tenured referents.

This change is necessary and moderately difficult (2-3 weeks work).

Status: Completed: see benchmarking results below

-- push root, remembered and clearable objects onto the scan stack

Evacuated roots, referents of remembered objects and clearable objects are currently copied breadth first to outside copyspaces. This change will push each such object onto the (empty) scan stack and spin their referents out on the stack until to stack pops to empty. This change leverages interleaved scanspaces and is expected to further improve object collocation in survivor and tenure regions.

This change is necessary and not difficult (3-5 days).

Status: Completed: see benchmarking results below

-- prioritize scanning of work in tenure region

Likelihood of referent object collocation when scanning inside a stack frame decreases over time because frequency of encountering already forwarded referents increases as nursery collection proceeds. This will pull tenure work from evacuator work lists before pulling any survivor work which should improve referent object collocation in the tenure region.

This change is to be tested and discarded if not found to improve performance in tenure-bound workloads like SPECjvm2008/derby. It is not difficult to implement and test (2-3 days).

Status: Abandoned: no advantage found in any salient metric

-- conditionally include evacuator source code

As required to cleanly separate evacuator development from release stream while producing/enabling builds of evacuator+scavenger-enabled SDKs.

This change is necessary and somewhat difficult (1-2 weeks).

Status: Partially completed: Scavenger.cpp has been lightly modified to dispatch gc threads to run MM_Evacuator::workThreadGarbageCollect() instead of MM_Scavenger::workThreadGarbageCollect in evacuator builds only when -Xgc:recursiveScanOrdering is explicitly selected as a command-line option. Scavenger operation and performance are in evacuator builds essentially identical to production builds when this option is excluded from the command line.

-- incorporate stack-based scanning and copying into the balanced collector

The balanced collector model generalizes the standard nursery collector model by extending the number of evacuation spaces from 3 (evacuate/survivor/tenure) to 25 (heap region age 0-24). The gencon evacuator model can be similarly generalized to N evacuation spaces that can be interleaved on the scan stack. and could employ the evacuator scanning model. This change will involve generalizing the evacuator model as required to enable balanced integration while maintaining gencon integration.

Goal is to produce one evacuation engine that can be used with gencon or balanced policies. This is a big deal and difficult (3-4 months).

Status: Not attempted.

-- recognize and chain recursive structures

The evacuator stack seldom overflows expect when a recursive structure (eg, linked list) is being scanned. For applications with deep recursing chains where elements have substantial payloads of mostly small objects this can become problematic for the processing evacuator as it may spend an inordinate amount of time processing chains in the structure without generating much outside copy. Evacuator can recognize this condition (stack overflow) and flush the stack, examining every scanned reference for a recursive relationship with parent (class equivalence). If a recursive reference is found, it can then be scanned and copied in a manner that copies recursing referents inside the stack and copies other referents to outside copy spaces, thereby distributing the work of processing these chains.

This is an important change for applications that use recursive structures involving long reference chains (eg, SPECjvm2008/xml.validation, SPECjbb2015). Plots of nursery collection times for these applications show large spikes and these have been identified as long linked lists with substantial referent material in each linked node. This change should eliminate these spikes. Moderately difficult (2-3 weeks), as time and priorities permit.

Status: Partially completed: The original idea behind this was to use stack overflow as signal that a recursive structure is being scanned. In that case, it was believed that a recurring link could be detected by 1) setting inside copy distance = 0 when the stack pointer returned to bottom frame (to force depth-first scanning) and 2) check the linear chain of objects being scanned for sub/superclass relationships if the stack overflowed again. This would be viable except for the common case (eg, XML DOM) when the recurring links are embedded in classes that have a common superclass but no direct sub/superclass relationship.

Instead, this is partially addressed by 1) setting inside copy distance = 0 when the stack pointer returns to bottom frame after stack overflow and then flushing all inferior frames to outside copy spaces as the stack unwinds. For branching recursive structures like XML DOM, this will push all but the first element scanned in each frame outside, and these elements may include other recursing chains that would otherwise be copied inside that stack. Some of this outside work may be distributable to other evacuator threads.

The impact of this change is reflected in improved cpu utilization stats for evacuator in "final" benchmarking results below.

-- copy objects depth-first on the scan stack

The current evacuator algorithm uses the same aliasing technique as scavenger, but more successfully as it constrains copy-scan distance to the width of a stack frame (~4kb). As a result only about ~20% of all copy operations result in 64 byte cache-line containment of referring pointer and head of copied referent, because the copy-scan distance tends to increase as the scan head moves away from the base of the frame. If referent objects were always copied up to the next stack frame, then the first pushed referent from any object would be located contiguously with its parent object. It is expected that this change would improve cache-line containment from ~20% to 40-50% and, when combined with hot field selection for first scanned referent (see #2663), this could result in a measurable boost to application throughput.

This experimental, not planned, and moderately difficult (2-3 weeks).

Status: Completed: Depth-first scanning is obtained when -XXgc:recursiveMaximumInsideCopyDistance is set to 0 on the command line. The results are ambiguous -- this mode of operation typically increases cache-line containment into the 30-50% range (2-3X greater than with default operational parameters), but the impact on application throughput is inconsistent and difficult to separate from reduced GC throughput due to additional stack overhead. This feature could be greatly improved if JIT could provide information regarding "hotness" of fields per class, in which case object scanners could be tuned to return the hottest reference slot first and the referenced "hot" field would then be copied adjacent to the referring object. See benchmarking results below.

-- apply stack-based depth-first scanning algorithm to standard marking

The standard marking scheme uses a breadth-first algorithm to mark and enqueue marked objects for scanning. If the nursery collector uses a stack-based depth-first algorithm for evacuating objects to survivor and tenure space, the frequent close proximity of parent and referent objects can be leveraged to greater effect if the marking algorithm also uses a stack-based depth-first algorithm. The performance boost for marking would them accrue from selecting a proximate referent object for scanning and from reduced trafficking in work packets (since most marking and scanning would occur on the stack).

This is just a thought, unplanned.

Status: Not attempted: I do believe that marking performance would benefit from stack-based scanning. In fact, I am convinced that improved Evacuator and Scavenger performance vs breadth-first copying is mainly due to reduced trafficking of work units between gc threads, and this advantage would accrue to marking as well if a similar stack-based traversal were to be employed.

-- _replace single-threaded heap scan in MMPhysicalSubArenaVirtualMemorySemiSpace::contract()

This method walks the nursery to locate and map addresses affected by nursery contraction and is single-threaded because the mark map may not be up to date at time of contraction. This can be parallelized as for evacuator by recursively walking objects starting from root and remembered set.

This is just a thought, unplanned.

Status: Not attempted.

ktbriggs commented 6 years ago

Resurrected and continuing evacuator development here: https://github.com/ktbriggs/omr/tree/evacuator. This branch includes previously reviewed commit OMR Integration unchanged modulo rebasing and antecedent commits relating to development of this epic. It is continuously rebased from eclipse/omr/master, the HEAD commit is generally volatile and in test.

Until a new PR is submitted for this work source can be reviewed at https://github.com/eclipse/omr/compare/master...ktbriggs:evacuator. Please post review comments and discussion relating to evacuator source or performance here. Benchmarking results will be posted here at appropriate milestones.

ktbriggs commented 6 years ago

Baseline results for WIP Evacuator: Baseline for 2664 development are below. Each benchmark was run 5 times with each gencon algorithm on an 8-core linux vm running rhel6 (rhel6x64cudavm1). Heap size was fixed at -Xms=-Xmx=2G for the transactional benchmark, 1280M for the data compilation and database processing benchmarks.

Performance metrics from gc logs.

algorithm	benchmark	gc-count	gc-ms	gc-%cpu	gc-kb/ms	duration-ms
scavenger	transactional	82176	17.516	99.764	1487	1439441
evacuator	transactional	83311	14.476	99.716	1797	1206052
scavenger	data compilation	6549	32.662	99.627	1494	213904
evacuator	data compilation	6629	27.200	99.535	1775	180312
scavenger	database processing	10889	1.560	96.861	1026	16992
evacuator	database processing	10768	1.427	96.082	1134	15366

Percent differences (100*(evacuator-scavenger)/scavenger) for nursery collection throughput (gc-kb/ms), duration (duration-ms) and benchmark throughput (scores) for these runs.

benchmark	gc-kb/ms	duration-ms	score
transactional	20.847%	-16.214%	1.422%
data compilation	18.809%	-15.714%	2.200%
database processing	10.526%	-9.569%	0.050%

Benchmark throughput improvement as measured by final benchmark score vs scavenger was not commensurate with improved gc throughput and reduced time spent in nursery collections. This may be due in part to breadth-first copying of root and remembered object referents in evacuator. For small objects like java String, breadth-first copying results in discontinuities between object and evacuated referents. Further improvement in this regard may be expected with next commit, which will spin root and remembered object referents out recursively on the evacuator scan stack.

ktbriggs commented 6 years ago

Benchmarking results for WIP Evacuator: Recursively scan root/remembered referent objects are below. Each benchmark was run 5 times with each gencon algorithm on an 8-core linux vm running rhel6 (rhel6x64cudavm1). Heap size was fixed at -Xms=-Xmx=2G for the transactional benchmark, 1280M for the data compilation and database processing benchmarks.

Performance metrics from gc logs.

algorithm	benchmark	gc-count	gc-ms	gc-%cpu	gc-kb/ms	duration-ms
scavenger	transactional	83015	17.837	99.769	1460	1480777
evacuator	transactional	83418	14.554	99.743	1791	1214085
scavenger	data compilation	6623	32.893	99.675	1481	217853
evacuator	data compilation	6773	26.994	99.531	1790	182834
scavenger	database processing	11112	1.518	97.544	1034	16873
evacuator	database processing	10912	1.499	96.217	1063	16364

Percent differences (100*(evacuator-scavenger)/scavenger) for nursery collection throughput (gc-kb/ms), duration (duration-ms) and benchmark throughput (scores) for these runs.

benchmark	gc-kb/ms	duration-ms	score
transactional	22.671%	-18.010%	0.750%
data compilation	20.864%	-16.074%	3.787%
database processing	2.805%	-3.113%	1.955%

This shows some improvement in benchmark throughput with evacuator as expected, without sacrificing gc throughput. The first run of the transactional benchmark produced a benchmark score that was ~25% lower than the mean, which may be due to environmental factors. Omitting this and the lowest scavenger score, there was a 2.557 improvement with evacuator vs scavenger.

These results are corroborated by identical trials run on an 8-core linux vm running ubuntu 16.04 (fyre vm). From those trials, the percent differences in benchmark scores were 8.951%, 6.310% and 3.058% for the transactional, data compilation and database processing benchmarks, respectively.

Further improvement in benchmark throughput is expected with the next commit, which will permit interleaving of survivor and tenure frames on the scan stack.

ktbriggs commented 6 years ago

Note that the gc-%cpu stats in the tables above are misleading. These are calculated from values in the verbose gc logs:

<gc-end ... type="scavenge" ... durationms=R usertimems=U systemtimems=S ...>

as (U+S)/(8*R)% for 8 CPUs. But U includes time CPU was stalled with no threads runnable as when evacuator stall, end, or synchronize. Since java was the only user process running, these voluntary wait states almost always stalled the CPU and parallelism stats reported in the gc logs are misleading.

More precise parallelization stats are maintained in the runtime scavenger stats, which evacuator also maintains, including counts and elapsed microseconds per gc thread for stall, end and synchronization events, but these are not reported in gc logs (yet). Evacuator builds with EVACUATOR_DEBUG_ALWAYS defined in EvacuatorBase.hpp enable reporting of these stats at the end of each generational collection for scavenger (default gencon algorithm) and evacuator (gencon + -Xgc:recursiveScanOrdering) operation. Additionally, histograms for object sizes, work packet sizes and cache sizes are reported along with counts for reference arc cache line containment (64/128/256 byte cache lines), etc. There is a minimal instrumentation overhead with these builds, which are otherwise optimized as for release builds.

Also, I reported '... no global collections after the first nursery collection ...' for the 2 commits above. In fact, there were 25-28 global collections for each of the data compilation runs (scavenger and evacuator). I'm going to produce instrumented builds for these 2 commits and redo the benchmarking runs on a more generic machine (not ...cuda...) with a larger heap for the data compilation runs. Will report the results with more precise global gc, CPU utilization and cache line containment stats here.

Note: More detailed reporting for evacuator runs can be obtained by defining EVACUATOR_DEBUG in EvacuatorBase.hpp (this also enables 100s of assertions in evacuator code that incur substantial overhead). Debug evacuator builds recognize an environment variable EVACUATOR_DEBUG_FLAGS containing a bitmap of diagnostic categories to print to stderr when java is running with gencon/evacuator. To obtain output at end of generational gc as for EVACUATOR_DEBUG_ALWAYS, export EVACUATOR_DEBUG_FLAGS=1; see EvacuatorBase.hpp for other diagnostic flags.

ktbriggs commented 6 years ago

Design note for WIP Evacuator: Interleave survivor/tenure scanspaces

Work on this feature has revealed a couple of evacuator design flaws. I will finish and benchmark this commit and publish the results here, expect significant regression due to poor parallelism.

The first flaw relates to flushing frames from the top of the stack down in the presence of stalled evacuator threads. The frames at the top of stack are less dense with live evacuation references than frames below, and the products of flushing them are even less dense and make light work for stalled evacuators that pick them up. Instead, running evacuators should service stalled evacuator threads by flushing work to their worklist from the bottom frame up in pushed order until they accumulate some minimum volume of distributable work and notify. The contributing evacuator will just pop over these empty frames on the way down after the stall condition clears.

It might also be helpful in stalling contexts to reduce evacuator work release thresholds to approximate a related scavenger statistic -- mean bytes scanned between alias-to-copy-cache switching events -- to better distribute loads away from threads that are generating and consuming inordinate volumes of work without generating significant outside work. Eg when flushing scalar and primitive array objects from frames to outside copyspaces release 512-byte workspaces, and split all pointer arrays with >1 elements into two split array workspaces (including previously split segments -- splitting is better than flushing for eg java string arrays).

The second flaw relates to the choice of 256 bytes to restrict which objects can be copied inside the stack. Ideally, to maximize cache line containment, all objects that are not large pointer arrays (which must be split into segments for parallel scanning) should be scanned and copied within the stack. The minimum size of a split array segment (~4k) is a pivotal constant in this regard and should determine maximum object size for inside scanning and copying and frame copy limit.

I'll undertake these changes next after benchmarking interleaved frames with the warts in.

With that said and two mea culpas in I'm going to lay on a little self-kudos for evacuator design. As evacuator takes shape I am finding it very workable and easy to make important changes like the two cases raised here. There are good reasons for this.

The main win comes from separating evacuator (the generational collection task) from the collector (MM_Scavenger minus code reachable from

MM_Scavenger::workThreadGarbageCollect(MM_EnvironmentStandard, ...)

which implements the generational collection task for scavenger). The collector's main concerns relate to whole-heap dynamics, including initiating and percolating collection cycles. Its nexus with the generational collection task is a single point where control transfers to 1 or more GC slave threads that effect the generational collection task. Separating these components simplifies and clarifies their implementations and admits a 1-1 binding of evacuator with thread/environment that obviates passing MM_EnvironmentStandard between evacuator methods.

Another win comes from the articulation of a family of specialized spaces, each minimally defining a contiguous range of addresses in survivor or tenure spaces. So evacuator has whitelists of whitespaces (reserved survivor or tenure space), a work list of workspaces (series of contiguous objects copied but not scanned), two outside copyspaces (space with a copy head, with work behind and white ahead), a large object copyspace (for receiving large objects to dispatch to the worklist), and a stack of scanspaces (copyspace with a scan head with completed work behind and work and/or whitepsace (pop!) ahead. Work and whitespace can be moved between different types of spaces with simple operators that effect the necessary pointer transforms.

Moreover, evacuator spaces allow optimal size for whitespace reservation and threshold for releasing work from copyspace to worklist to be determined independently. This is in contrast to scavenger's MM_CopyScanCacheStandard, which makes work packet size dependent on the cache allocation size. This entanglement becomes a nuisance when, for example, scavenger attempts to increasing responsiveness to stalling threads by reducing work packet size.

@RSalman, @amicic, @charliegracie some of the above might be helpful in your current work with scavenger (#2902, ...).

ktbriggs commented 6 years ago

Further to my comment above regarding CPU usage stats reported in the GC logs, there is also a problem with the internal idle-time stats maintained in MM_ScavengerStats. These stats (stall, sync, completion counts and times) do not give a complete picture of total CPU time available and time utilized because they do not include time between point of thread dispatch and time thread actually kicks off and starts running. I think this value should be tracked in scavenger stats, and that the endpoints for the cycle should be the time the first thread enters its main method and the time the last thread leaves its main method. Also I think these stats should be normalized per the %cpu allocated to the java process as currently reported in the <gc-end ... type="scavenge" ... durationms=R usertimems=U systemtimems=S ...> clauses in GC logs and reported in the GC log. I will create an issue for this.

ktbriggs commented 6 years ago

I'm pulling back the benchmarking results for WIP Evacuator: Interleave survivor/tenure scanspaces previously published here. There was a bug that was tripping up frame interleaving that accounted for much of the reported regression. Also the workflow for consuming local work from outside copyspaces and own worklist needs to revert to pull from local worklist before pulling from outside copyspaces as in baseline commit. I'll report on this commit when I have done more testing and benchmarking.

ktbriggs commented 6 years ago

Most recent commit WIP Evacuator: Interleave survivor/tenure scanspaces (instrumented) is for the main evacuator development branch that I report results on here. It has instrumentation enabled to write summary metric data per nursery collection and selects evacuator as default generational task for gencon to enable perf testing SDK built from this commit vs scavenger default in base SDK. The preceding 2 sets of 3 commits relate to enabling instrumentation of previous commits on a different evacuator branch and can be ignored.

ktbriggs commented 6 years ago

Benchmarking results for WIP Evacuator: Interleave survivor/tenure scanspaces (instrumented) are below. Each benchmark was run 5 times with each gencon algorithm on an 8-core linux vm running rhel6 (rhel6x64cudavm1). Heap size was fixed at -Xms=-Xmx=2G for the transactional benchmark, 1280M for the data compilation and database processing benchmarks.

Changes included in this commit were:

Allow small object (<=256b) push() from survivor to tenure and vice versa on the evacuator scan stack.
Flush frames from bottom of stack up to and excluding current frame in presence of stalled evacuators.
Simplify loading of root objects into scan stack.
Enable spin for monitor-entry, disable for monitor-try-enter.
Stop coalescing workspaces when adding to worklist.

Performance metrics from gc logs.

algorithm	benchmark	gc-count	gc-ms	gc-%cpu	gc-kb/ms	duration-ms
scavenger	transactional	40327	19.391	99.793	1340	782000
evacuator	transactional	41627	16.262	99.792	1598	676976
scavenger	data compilation	5345	45.274	99.873	1364	241990
evacuator	data compilation	5573	37.022	99.796	1669	206324
scavenger	database processing	8618	1.723	97.784	1068	14854
evacuator	database processing	8999	1.622	96.677	1118	14601

Percent differences (100*(evacuator-scavenger)/scavenger) for nursery collection throughput (gc-kb/ms), duration (duration-ms) and benchmark throughput (scores) for these runs.

benchmark	gc-kb/ms	duration-ms	score
transactional	19.253%	-13.430%	4.513%
data compilation	22.361%	-14.739%	3.787%
database processing	4.682%	-1.703%	0.647%

Evacuator gc throughput gc-kb/ms and duration duration-ms have regressed slightly from the previous commit but application throughput as measured by benchmark score is improved, and that is the primary outcome that is of interest. While none of the results reported here and with previous commits can be said to be significant in a statistical sense there has been a definite trend toward improved application throughput. This commit was expected to produce greater prevalence of close copying (parent reference and head of referent object collocated with a cache line), particularly with respect to small objects like Java String comprised of a single reference to associated char array. While, no increase in collocation was seen in these runs (see table below), this build would ensure that arrays of such objects would be handled by pushing each element and copying the referent object next to it. So I suspect the application is getting some benefit in this respect.

This build was instrumented so that evacuator and scavenger output additional internal metrics derived from MM_ScavengerStats per nursery cycle. These provide a more detailed and accurate picture of the percentage of available cpu resources used by evacuator and scavenger threads during nursery collections. Also reported are statistics relating to the prevalence of small objects (%small, <=256b) and proportion (%inside) of scanned material that was generated and consumed inside aliased caches (scavenger) or inside scan stack frames (evacuator), as opposed to pulled from outside (ie, from work queues). For evacuator, the difference %small - %inside is a measure of the volume of copy that was forced to be flushed outside the stack in order to produce work for stalled evacuators. These stats are shown in the table below, together with the percentages (%<64b, %<128b, %<256b) of object evacuations that resulted in cache line containment for the reference arc (64-, 128- and 256-byte cache lines as for x86-64, powerpc, and z-series platforms, respectively).

Performance metrics from internal metrics (cache line containment percentages are cumulative).

algorithm	benchmark	gc-%cpu	%small	%inside	%<64b	%<128b	%<256b
scavenger	transactional	97.469	97.357	73.442	6.218	9.791	16.579
evacuator	transactional	98.803	97.389	90.439	12.626	18.031	26.165
scavenger	data compilation	96.150	77.341	93.199	2.878	5.971	11.853
evacuator	data compilation	96.396	77.387	70.348	8.989	17.753	29.022
scavenger	database processing	76.808	51.638	75.573	12.090	14.064	15.481
evacuator	database processing	73.249	52.273	29.720	16.321	19.237	21.613

This is the first time I have reported stats for cache line containment, SDKs for previously reported commits were from production builds and not instrumented to report these metrics. I have run instrumented builds privately on an 8-core ubuntu vm for the previous commits, and containment metrics have remained stable since the original commit (Evacuator (OMR integration)).

For previous commits, with homogeneous stack frames, containment metrics generally looked about like these:

algorithm	benchmark	%<64b	%<128b	%<256b
scavenger	transactional	6.230	9.410	15.005
evacuator	transactional	12.091	16.975	23.956
scavenger	data compilation	3.152	6.196	11.959
evacuator	data compilation	8.480	16.734	27.333
scavenger	database processing	10.932	12.649	13.874
evacuator	database processing	15.635	18.545	20.900

Containment has improved slightly for larger cache lines with this commit, but this effect is limited by the proportion of live object material evacuated across the nursery/tenure boundary. The next few commits will focus on improving cache line containment in order to see what are the differential effects of cache line containment on application and gc throughput.

ktbriggs commented 6 years ago

Benchmarking results for WIP Evacuator: Options for evacuator tuning (instrumented) are below. Changes included in this commit are:

add jvm option -XXgc:recursiveMaximumInsideCopyDistance -- allow copy inside current frame as long as the span between scan and copy heads not greater than this value, otherwise push and copy to next frame. Default value is 0 bytes, which forces depth-first scanning and copying on the stack. Any non-negative value is acceptable.
add jvm option -XXgc:recursiveMaximumInsideCopySize -- allow objects up to this size in bytes to be copied inside stack whitespace. Default value is 256 bytes (compressed or uncompressed references), minimum value is 64 bytes.

The default copy span recursiveMaximumInsideCopyDistance=0 forces depth-first scanning and copying inside the stack. The default copy size recursiveMaximumInsideCopySize=256 admits most scalar objects and small pointer arrays for copying inside the stack for optimal performance. These changes plug neatly into the existing scanning loop, but the loop is otherwise unchanged and is optimized for copying inside frame, For this commit as for previous commits each benchmark was run 5 times with each gencon algorithm on an 8-core linux vm running rhel6 (rhel6x64cudavm1). Heap size was fixed at -Xms=-Xmx=2G for the transactional benchmark, 1280M for the data compilation and database processing benchmarks.

I ran the 5x{scavenger,evacuator} benchmark suite on with default recursiveMaximumInsideCopySize=256 for three recursiveMaximumInsideCopyDistance (maximum inside copy span) configurations:

max span	stack action
0 bytes	scan and copy depth-first always
64 bytes	scan and copy inside frame until span >64 bytes then depth-first until end of frame
768 bytes	scan and copy inside frame until span >768 bytes then depth-first until end of frame

I am reporting only the most salient outcomes for these runs, carrying forward outcomes from WIP Evacuator: Interleave survivor/tenure scanspaces (instrumented) for comparison (these correspond roughly to a (768,256) configuration for (span,size).

Impact on benchmarks throughput (score) is ambiguous owing to small sample size (5 runs) and relatively high within-group variance. To improve this, I have consolidated aggregate results from the 20 scavenger trials from this commit and WIP Evacuator: Interleave survivor/tenure scanspaces (instrumented). Aggregation is possible since the scavenger algorithm has remained stable and all runs used the same heap configuration and JVM options on the identical platform.

I will continue to runs trials on rhel6x64cudavm1 but will run 10x{evacuator} per benchmark trial instead of 5x{evacuator,scavenger} and use this reporting format going forward. The results from these trials will be comparable to the results reported here from the aggregate 20x{scavenger} trial with instrumented SDKs.

All values in the tables are averages:

scan-gb

Average volume (gigabytes) evacuated to survivor or tenure space per benchmark run (N runs per trial).

%tenured

Average percentage of volume evacuated to tenure space per benchmark run.

scan-kb/ms

Average evacuation throughput (kilobytes/millisecond) per benchmark run.

%cpu

Percentage of available cpu-seconds (number of gc threads * evacuation duration) that gc threads were not stalled waiting for work or end of scan cycle or waiting on a synchronization event. This is slightly inflated as it does not discount time waiting for gc thread to join the evacuation cycle (this is an important factor in the database processing benchmark, which presents very low volume nursery live sets).

%inside

For evacuator, the percentage of evacuation volume that was copied inside the evacuator scan stack. For scavenger, this approximates the percentage of evacuation volume that was scanned without pulling scan-copy caches from the distributed worklist. Worklisted caches include all root and remembered referents and scanned referents flushed breadth-first in the contexts of stalled gc threads. It is a slight underestimate because caches can infrequently be worklisted >1 times.

%<cache line size in bytes>b

The percentage of scanned-slot/copied-referent reference arcs with endpoints collocated within the same cache line, aka cache line containment.

Aggregate scavenger benchmarking results (rhel6x64cudavm1)

Scavenger only, N=20 for each benchmark (instrumented SDKs only).

scan-gb	%tenured	scan-kb/ms	%cpu	%inside	%<64b	%<128b	%<256b
transactional
198.984	.178	1349.224	97.500	73.296	7.443	11.717	19.838
data	compilation
63.661	13.987	1402.144	96.200	93.248	3.258	6.768	13.449
database	processing
3.038	14.489	1086.117	76.300	75.973	14.346	16.685	18.359

WIP Evacuator: Interleave survivor/tenure scanspaces (instrumented)

Evacuator only, N=5

scan-gb	%tenured	scan-kb/ms	%cpu	%inside	%<64b	%<128b	%<256b
transactional
206.276	.177	1597.489	98.900	90.439	17.100	24.421	35.438
data	compilation
65.685	14.609	1669.114	96.400	70.348	12.665	25.013	40.889
database	processing
3.113	14.424	1117.677	73.300	29.720	20.821	24.541	27.573

Percentage difference in total benchmark throughput (scores) per trial (100*(evacuator-scavenger)/scavenger)

span	transactional	data compilation	database processing
frame	4.513	3.787	0.647

WIP Evacuator: Options for evacuator tuning (instrumented)

Evacuator only, recursiveMaximumInsideCopyDistance=0, N=5

scan-gb	%tenured	scan-kb/ms	%cpu	%inside	%<64b	%<128b	%<256b
transactional
196.664	.185	1379.045	99.100	92.399	47.926	60.708	68.243
data	compilation
63.364	14.191	1405.798	96.400	84.079	45.221	51.947	57.246
database	processing
3.059	14.305	1070.861	73.700	31.983	25.606	29.016	31.298

Percentage difference in total benchmark throughput (scores) per trial (100*(evacuator-scavenger)/scavenger)

span	transactional	data compilation	database processing
0	2.369	0.456	0.529

Evacuator only, recursiveMaximumInsideCopyDistance=64, N=5

scan-gb	%tenured	scan-kb/ms	%cpu	%inside	%<64b	%<128b	%<256b
transactional
204.161	.172	1326.400	99.400	52.937	35.994	39.505	41.504
data	compilation
63.593	14.621	1418.449	96.800	64.563	44.137	50.839	56.286
database	processing
3.047	14.205	1145.872	76.100	28.377	24.921	28.072	29.905

Percentage difference in total benchmark throughput (scores) per trial (100*(evacuator-scavenger)/scavenger)

span	transactional	data compilation	database processing
64	2.789	-.923	1.062

Evacuator only, recursiveMaximumInsideCopyDistance=768, N=5

scan-gb	%tenured	scan-kb/ms	%cpu	%inside	%<64b	%<128b	%<256b
transactional
202.087	.178	1530.679	98.900	92.247	28.089	36.979	46.496
data	compilation
66.418	14.969	1637.534	96.300	85.706	11.041	22.525	38.891
database	processing
3.070	14.367	1123.113	73.300	31.751	20.978	24.739	28.019

Percentage difference in total benchmark throughput (scores) per trial (100*(evacuator-scavenger)/scavenger)

span	transactional	data compilation	database processing
768	2.368	5.267	0.062

Depth-first scanning and copying produced a significant increase in cache-line containment, but gc throughput regressed, Evacuator in this commit is still optimized for scanning and copying within frames until the distance between the scan and copy heads exceeds the preset span limit. For the next commit I will optimize the scanning loop for depth-first operation, with a view to eliminating the span limit preset option entirely and always scanning and copying depth-first if the refactored scanning loop produces throughput approaching evacuator's best in previous commits. I have two reasons for this:

depth-first will almost always produce optimal cache-line containment
depth-first admits a simple solution to the problem of identifying and parallelizing linked structures involving long chains of instances of self-referencing classes

The results for the data compilation benchmark trials suggest that linked structures are prevalent in this workload. For example, its highest benchmark throughput was realized with recursiveMaximumInsideCopyDistance=768. If the total volume of material expressed by a single linked element is <768 bytes, this configuration will allow scanning and copying to continue until the whitespace attached to the frame is exhausted, forcing a new whitespace allocation and a push up to the next frame. Since whitespace reservations for the stack are up to 128kb, and the stack is bounded at 32 frames, up to 4mb of material can be copied inside the stack until it overflows into outside copyspaces.

The next commit will focus on optimizing the stack scanning loop for depth-first scanning and copying.

ktbriggs commented 6 years ago

I am attaching some images that may help to illustrate how evacuator works. These describe evacuator circa WIP Evacuator: Interleave survivor/tenure scanspaces (instrumented).

heap

stack

evacuator

transducer

ktbriggs commented 6 years ago

Depth-first scan stack operation is a WIP. I've taken a first run at optimizing the scanning loop for depth first and initial results are disappointing but provide some insight into evacuator bottlenecks. I instrumented the stack to count frame activation (frame is activated when an object is pushed into the frame or a workspace is pulled into the bottom frame).

All objects big and small, scalar or array, are scanned on the stack. The scan stack is bounded at 32 frames. When it overflows, as it frequently does with the data compilation benchmark, objects are copied to outside copyspaces until the stack pops. Flushing in single-threaded collections occurs only when the stack overflows or when the evacuator is presented with an object too large for inside copy (>256 bytes) or too large to fit in remaining stack whitespace that is not small enough to discard (>32 bytes). In multithreaded collections, running evacuators flush all copy outside whenever one or more other evacuators are stalled, responding to changes in stall conditions after each object is evacuated.

Below is a histogram showing average proportions of activation per frame per nursery collection for the three benchmarks I've been tracking. These data were obtained from one run per benchmark with only one gc thread to remove flushing response under stall conditions from the picture.

Those are relative to the overall volume of material that is copied inside the stack versus copied to outside copyspaces, shown below.

The data compilation benchmark presents a lot of linked structures, which force material to outside copy when they overflow the stack. The database processing benchmark I think is presenting a large number of array structures, which repeatedly force outside copying whenever the remainder stack whitespace is not discardable and too small to receive an array element. In those cases the optimized path for copying inside is less frequently used and evacuation slows down.

While all objects are scanned on the stack, objects larger than recursiveMaximumInsideCopySize are always copied outside the stack. For these runs, with compressed (4-byte) object references, the default setting for this parameter is 256 bytes, and the proportions of objects eligible for copying into the stack are shown below (these are properties of the benchmark and are largely invariant between runs). While this chart presents object counts and the chart above presents object volumes, some insights can be obtained by comparing them.

With the transactional benchmark almost all objects are copied inside the stack without overflowing, and from previous runs with inside copy size limited to 256 bytes we know that ~97% of evacuated objects are ≤256 bytes in size, so this workload obtains the best cache line containment: >50% for this machine with 64-byte cache lines (bottom chart). The data compilation benchmark presents almost 97% of objects eligible for inside copying with ~75% of objects ≤256 bytes in size but arranges them in recursing linked chains with length >32 which force about 20% of evacuated volume to be flushed to outside copyspaces when the stack overflows. With the database processing benchmark less than half of the objects presented are eligible for inside copying and almost all of these are ≤256 bytes; more than half are >256 bytes in size and are almost certainly arrays. If there are any recursing linked structures in the database processing workload they have maximal chains of length <32, there are no instances of stack overflow with this workload.

Cache line containment will almost always be maximal with only one evacuator thread, as this will degrade as flushing to outside copyspaces increases under pressure from stalled threads. So this is about as good as it gets for these benchmarks:

And that is pretty good, compared with cache line containment reported for the multithreaded scavenger and evacuator runs above. However, benchmark throughput has not improved, it has even degraded slightly. I suspect that cache line containment has little impact on application performance because reference arcs in the nursery decay along with their endpoints, that is to say, very quickly. The most important determinant of effect of gc on benchmark performance to date has been gc throughput, which has regressed with depth-first scanning (it is about equal to scavenger throughput).

The question of how cache line containment contributes to benchmark throughput may be settled by a scavenger/evacuator run-off with single-threaded collection and depth-first evacuation in the evacuator SDK. If there is no significant between-group difference in gc throughput, any differences in benchmark throughput may be attributable to the differences in cache-line containment.

ktbriggs commented 6 years ago

Results from a series of 10x{evacuator} trials with WIP Evacuator: Optimize for depth-first scanning (instrumented) are below. These were run on an 8-core linux vm running rhel6 (rhel6x64cudavm1) as before. Heap size was fixed at -Xms=-Xmx=2G for the transactional benchmark, 1280M for the data compilation and database processing benchmarks. These results are comparable with the aggregate scavenger trial results reported above (see Aggregate scavenger benchmarking results (rhel6x64cudavm1)).

Changes in this commit related to optimizing the evacuator scan stack for depth-first operation.

Evacuator only, depth-first scanning, N=10

scan-gb	%tenured	scan-kb/ms	%cpu	%inside	%<64b	%<128b	%<256b
transactional
199.126	.182	1317.957	98.200	89.482	45.779	57.722	64.704
data	compilation
63.841	14.936	1376.604	96.500	85.809	45.964	52.793	58.178
database	processing
3.045	14.241	941.811	60.900	35.107	21.203	24.132	26.137

Percentage difference in total benchmark throughput (scores) per trial (100*(evacuator-scavenger)/scavenger)

transactional	data compilation	database processing
0.601	0.703	0.489

Optimization did little to improve evacuator gc throughput (scan-kb/ms) over the depth-first (recursiveMaximumInsideCopy=0) evacuator trials with the previous commit WIP Evacuator: Options for evacuator tuning (instrumented). Pushing an object up to next frame may involve parking foreign (other region) remainder whitespace from next frame and/or pulling remainder whitespace from the frame that most recently held whitespace for the region that will receive the evacuated object from the slot being scanned. Transitions between regions is the uncommon case; the transactional benchmark tenures <1% of evacuated volume, but shows the same magnitude of regression as the other two benchmarks. The bottleneck in the current commit is that for most push operations whitespace for copy must be pulled into pushed frame from the frame that most recently used it, and this wil not be the pushed frame unless the previously scanned object on pushing frame was a leaf object copied into same region. This is irreducible, and the overhead is a serious drag on evacuator stack operation.

These results do lend some support for gc throughput as the dominant factor in the improvement in benchmark throughput seen with previous commits. The transactional benchmark is the most sensitive to gc performance -- each trial runs for ~40 minutes compared to 5 minutes for the other two benchmarks and evacuates a significant volume of its allocation space on each gc (the database processing benchmark evacuates only ~1% of allocation space at steady state). In these evacuator trials, gc throughput has regressed to about the same level as scavenger for the transactional benchmark and 64-byte cache line containment is >6X greater but there is only a slight improvement in benchmark throughput.

ktbriggs commented 6 years ago

Here are typical verbose gc plots showing nursery size and volumes of live material evacuated to survivor and tenure spaces for the three benchmarks included in these trials. The tilt ratio is the volume of the evacuation space as a percentage of the nursery size. The amounts flipped and tenured are the volumes evacuated to survivor and tenure space, respectively.

These are from evacuator trails with the most recent (depth-first) commit.

Transactional (tilt ratio ~89%)

Data compilation(tilt ratio ~77%)

Database processing (tilt ratio ~90%)

The database processing benchmark at steady state will never show any appreciation with increasing cache line containment as almost all live objects die early and are never evacuated or die early in the next generation if they are flipped. In any case cache line containment is short-lived as reference arcs die when either endpoints dies. The high volume of evacuation seen in the first two minutes relates to loading the database into RAM -- this is time not factored into the final benchmark score, which reflects performance during the last four minutes at steady state. In previous work with this benchmark in the context of identifying cold objects in a generational heap I observed that >90% of object references involve tenured objects. Cache line containment obtained when live objects are tenured is persistent in the heap image of the database, but benchmark throughput is indifferent to this as a determining factor.

The data compilation benchmark shows pressure on the nursery, with transient live objects frequently overflowing into tenure space and dying soon after. Each trial run of scavenger or evacuator with this benchmark incurred ~6 global collections of tenure space. I would expect that this benchmark would be more sensitive cache line containment as a determinant for benchmark throughput when run with a larger nursery, as below. This benchmark showed marked increase in throughput relative to scavenger and other evacuator configurations in the trials reported above for Evacuator only, recursiveMaximumInsideCopyDistance=768, N=5 -- this configuration allowed recursive structures to be spun out within a stack frame indefinitely as long as the reference arc length for object copy remained <768 bytes. But it is unclear whether this was due to evacuator throughput or cache line containment as both of these factors improved in this configuration.

This is the evacuation profile for one run of the data compilation benchmark with a larger nursery (-Xms=-Xmx=1g -Xmn640m) on an 8-core x86-64 vm running Ubuntu 16.04:

And this is the stack activation profile for this run:

Still showing significant stack overflow as before (object linkage is orthogonal to age).

The stack activation histogram below shows that most of the stack overflow for this benchmark has been eliminated with recursiveMaximumInsideCopyDistance=131072, which allows linked structures to spin out within frames until stack whitespace is depleted and renewed. In this configuration, most of the material encapsulated by linked structures is copied inside the stack, with up to 128kb copied inside each frame. From the stack activation histogram it can be seen that almost all of the recursive structures in this workload encapsulate <1.5mb and are completely copied within the first 10 stack frames. This relieves stack overflow but leaves the problem of parallelizing these structures unsolved for now, so a relatively large volume is now flushed to outside copyspaces to support stalling evacuators.

I ran a 5x{scavenger,evacuator} trial with this benchmark alone using the same 8-core platform (rhel6x64cudavm1) and SDK as for WIP Evacuator: Options for evacuator tuning (instrumented) with nursery size doubled to 640m for the scavenger and evacuator runs and recursiveMaximumInsideCopyDistance=768 for the evacuator runs. These are the results:

recursiveMaximumInsideCopyDistance=768, N=5

scan-gb	%tenured	scan-kb/ms	%cpu	%inside	%<64b	%<128b	%<256b
scavenger
27.642	.031	1373.114	99.960	93.087	2.804	6.236	12.824
evacuator	`-Xmn640m`
28.353	.025	1601.669	99.964	86.239	10.800	22.587	39.363
evacuator	`-Xmn320m`
66.418	14.969	1637.534	99.963	85.706	11.041	22.525	38.891

Percentage difference in total benchmark throughput (scores) vs scavenger with 640m nursery (100*(evacuator-scavenger)/scavenger): 2.613%. Relative to previous results for evacuator only from the earlier trial with 320m nursery benchmark throughput increased by 12.484% and gc throughput decreased 1.019% while cache line containment is about the same.

With the larger nursery size the volume of live material flushed outside the stack is about the same for the evacuator trial. As things stand, this can be reduced by selecting larger value for recursiveMaximumInsideCopyDistance. A value of 128kb would permit up to 4mb of material to be spun out on the stack before overflowing. This will improve when evacuator learns to recognize and parallelize recursing structures. That's on the evacuator to do list, and I have plan that will be implemented later.

ktbriggs commented 6 years ago

I'm taking a hiatus from evacuator for the time being to focus on some writing and prepare for some upcoming workshops. Stay tuned -- when I get back to it I will try a different configuration that will relieve some of the pressure on the stack to improve gc throughput while still keeping pretty good cache line containment. In outline, this configuration will send all array objects to outside copy spaces (deferring scanning until they are pulled into the bottom frame of an evacuator stack) and push only scalar objects that escape the region being scanned (ie parent in tenure and child in survivor or vice versa), while allowing inside copying subject only to the recursiveMaximumInsideCopyDistance parameter value.

ktbriggs commented 6 years ago

I have reset the evacuator branch in my github and local OMR repos to the WIP Evacuator: Interleave survivor/tenure scanspaces (instrumented) commit and similarly reset the evacuator branch in my github and local OpenJ9 repos to the contemporaneous WIP Evacuator: Recursively scan root/remembered referent objects commit.

These commits correspond to the build that has produced the best results on the test benchmarks and I am reporting these results publicly in a poster expo at CASCON 2018 along with cloning coordinates for these stable evacuator branches and for OpenJDK in case anyone wants to build it themselves.

Follow-on commits reported here, and ongoing evacuator work, have been moved to the evacuator-wip branches in my github and local OMR and OpenJ9 repos.

ktbriggs commented 5 years ago

Apologies for the interminable ktbriggs added a commit ... notices above. I've squashed all of the commits subsequent to the commit submitted for the original (closed) OMR integration pull request in anticipation of a new pull request for this work. I'm putting a wrap on evacuator development at this milestone, modulo any bug fixes that come out of a more complete round of testing of Openj9/Linux SDKs for x-86-64, powerpc and z-series platforms, and will submit new PRs for the OMR and OpenJ9 components of this work when testing is complete.

OpenJ9 Integration

This commit includes all changes to evacuator since the OMR integration as detailed in the commit message. In summary, these changes enable root, remembered, and clearable objects to be scanned immediately on the stack and allow survivor and tenure stack frames to be interleaved on the stack.

Evacuator is selected for nursery collection in OMR runtimes if MM_GCExtensionsBase::evacuatorEnabled and MM_GCExtensionsBase::scavengerEnabled are both true when the OMR heap and collector are initialized. Stack geometry (number and width of frames) is configurable via parameterized settings as is the object size threshold for objects copied inside stack frames. Depth-first scanning of objects copied inside the stack is obtained by selecting 0-width frames. Breadth-first scanning is forced by selecting a single-frame stack. Selecting a large object size threshold with a smaller frame width admits objects larger than the frame width to be pushed on the stack and scanned, pushing each copied referent up the stack.

The evacuator algorithm has been revised extensively since OMR integration and source comparisons between these two commits will be unhelpful. Reviewers may find it more convenient to clone the evacuator branches in my omr and openj9 repos and review these changes in a git-enabled IDE. Comments can then be posted here or held for posting in the forthcoming evacuator PRs for omr and openj9.

As of this commit evacuator consistently passes all OpenJ9 tests on x86-64 platforms, and all platform/gold tests on internal J9 Java builds, with evacuator preselected as default algorithm for the gencon GC policy.

Benchmarking Evacuator vs Scavenger

Below are test results from a selection of benchmarks (transactional, data compilation, database processing) as of this commit. Benchmarks were selected on the basis of frequency of evacuation of small objects (not more than 256 bytes in size) and included a range of common workload types:

benchmark	small	minutes/run
transactional	~97%	~40
data compilation	~75%	~4
database processing	~50%	~6

These results include 10 {scavenger, evacuator 4k} runs for each benchmark with default 4096 byte evacuator stack frame width and 10 {scavenge, evacuator 0k} runs with 0 stack frame width, which forces depth-first scan/copy stack operation. For these 2 evacuator confgurations objects up to 256 bytes in size were allowed to be copied inside stack frames, larger objects were copied breadth-first to outside copyspaces. A third set of 10 {scavenger, evacuator 128k-0k} runs allowed a objects up to 128 kilobytes in size to be copied inside 0-width stack frames, which also forces depth-first scan/copy stack operation. All evacuator runs used default number of stack frames (32).

run configuration (N)	stack depth	stack width	max inside copy size
scavenger (30)	n/a	n/a	n/a
evacuator-4k (10)	32	4096	256 bytes
evacuator-0k (10)	32	0	256 bytes
evacuator-128k-0k (10)	32	0	128 kilobytes

Scavenger and evacuator ran from the identical OpenJ9 Java9 JVM image with identical command options except to select scavenger or evacuator algorithm for gencon and set evacuator operational parameters as shown in the table above. Heap size was fixed at -Xms=-Xmx=2g for the transactional benchmark, 1280m for the data compilation and database processing benchmarks with default nursery size except for -Xmn=640m for data compilation. All benchmarks were interleaved as (evacuator scavenger)¹⁰ without competing user processes on the same 8-core physical machine for 3 platforms:

Linux x86-64 (8 cores, rhel6x64cudavm1.canlab.ibm.com)
Linux Power 8E (8 cores, rhel7levm5.ottawa.ibm.com)
Linux System Z (8 cores, j9s12z8.torolab.ibm.com)

There were no differences in code or configuration between scavenger runs, and all scavenger results from 30 runs for each benchmark have been aggregated for comparison with the results from the respective 10 evacuator runs. For the GC-related metrics the values in the chart data labels represent the mean of >10,000 metric values sampled per nursery collection. The (x% / y% / z%) percentages in the label on the X-axis for each metric relate to the magnitude of the difference between evacuator and scavenger mean metric values from the 3 evacuator configurations. These percentages are calculated as 100*(evacuator-scavenger)/scavenger or evacuator-scavenger using the metric values shown in the chart data labels.

GC Throughput

This is the rate at which the evacuation space is emptied per nursery collection.

Linux x86-64 (8 cores, rhel6x64cudavm1.canlab.ibm.com)

Linux Power 8E (8 cores, rhel7levm5.ottawa.ibm.com)

Linux System Z (8 cores, j9s12z8.torolab.ibm.com)

GC Time

This is the average time spent in each nursery collection.

Linux x86-64 (8 cores, rhel6x64cudavm1.canlab.ibm.com)

Linux Power 8E (8 cores, rhel7levm5.ottawa.ibm.com)

Linux System Z (8 cores, j9s12z8.torolab.ibm.com)

Cache Line Containment

This is the percentage of objects evacuated with cache line containment. An evacuation reference arc from S (referring slot) to R (referent) has N byte cache line containment if the exclusive OR of the addresses of S and R is <N (N={64, 128, 256} for x86-64, Power 8, System Z platforms).

The numerator for these percentages is the number of objects evacuated with cache line containment and the denominator is the total number of objects evacuated, per collection.

Linux x86-64 (8 cores, rhel6x64cudavm1.canlab.ibm.com)

Linux Power 8E (8 cores, rhel7levm5.ottawa.ibm.com)

Linux System Z (8 cores, j9s12z8.torolab.ibm.com)

CPU Utilization

This is the percentage of CPU time consumed by active scavenger/evacuator gc threads over the duration of each run. These were calculated as *100 - ((sum(W_x) / (8 sum(D_y))) where W_x ranges over the wait times for gc threads stalled during nursery collections and D_y** ranges over the duration of nursery collections.

Linux x86-64 (8 cores, rhel6x64cudavm1.canlab.ibm.com)

Linux Power 8E (8 cores, rhel7levm5.ottawa.ibm.com)

Linux System Z (8 cores, j9s12z8.torolab.ibm.com)

Benchmark Scores

This shows the percentage difference between benchmark scores for scavenger and evacuator runs. Each benchmark produces a single numeric score representing its throughput as of the end of each run.

These results should not be taken too literally, as all of these differences are roughly within one standard deviation of the respective set of scores.

Linux x86-64 (8 cores, rhel6x64cudavm1.canlab.ibm.com)

Linux Power 8E (8 cores, rhel7levm5.ottawa.ibm.com)

Linux System Z (8 cores, j9s12z8.torolab.ibm.com)

ktbriggs commented 5 years ago

I am capping work on evacuator at this point. Most of the viable experiments outlined in the first comment above have been concluded, with the exception of reversing order of mutator thread stack walk (currently top-down, my conjecture is that bottom-up traversal would improve parallelism...). Most of the work done since the last report has been focused on optimizing and streamlining evacuator scan stack operation to improve depth-first performance, conserve whitespace, and reduce compiled code size with more judicious inlining of methods. I will be updating the evacuator design (#2361) to reflect changes and new features introduced since it was written several months ago and will include discussion to interpret results reported here.

I ran a comprehensive suite of benchmarks with the most recent build, including the transactional, data compilation, and database processing benchmarks that were run against earlier iterations. I have added another, more processing-intensive benchmark that I refer to as enterprise in the tables below. For this benchmark I have included a range of run configurations in order to demonstrate the effects of the key evacuator runtime options inside copy size and distance (see Legend in the figure below). Depth-first scanning is obtained with inside copy distance 0, and I have added run results for depth-first transactional, data compilation and database processing benchmarks as well. Evacuator defaults for inside copy size/distance are 256/4096 bytes.

Note: The Legend for cpu% -- 30 cores applies to the enterprise benchmark only. The other three benchmarks ran with 8 cores enabled.

The enterprise benchmark was run twice for each run configuration and the results reported above are averages over the two runs. This benchmark takes about 2 hours to run and produces two scores: score-A is sensitive to application throughput, score-B is sensitive to GC throughput. The enterprise runs were executed on a single NUMA node (node 0 of lnxem64tvm10-resv.ottawa.ibm.com, 30 cores, 64gb RAM) using a 56gb heap with default (15.5gb nursery).

The other benchmarks were run 5 times in each run configuration and these results are averaged over the 5 runs. These were executed on a single NUMA node (node 0 of lnxem64tvm6-resv.ottawa.ibm.com, 8 cores, 9gb RAM). Allocated heap/nursery sizes were as for previously reported runs (transactional 2048/512mb, data compilation 1280/640mb, data processing 1280/320mb).

Total compiled code size for net new code introduced by this change as determined by nm -S --size-sort -t d Evacuator*.o is 67kb.

ktbriggs commented 5 years ago

With some tweaking to improve whitelist utilization, and to increase the overflow threshold from 3 to 8 for trimming oversized remainder whitespace from copyspaces, obtained the results shown below. Whitespace remainders in scan stack frames are required to be <32 bytes before they can be trimmed and the remainder discarded. In previous builds objects ≥32 bytes that would normally be copied inside the stack would be copied to an outside copyspace if they did not fit in the remaining whitespace. For outside copyspaces whitespace remainders were never trimmed until they were reduced to <768 bytes or the overflow threshold (8*minimum workspace size, remainders ≥768 bytes are trimmed if more than this amount of material overflowed the copyspace) was breached.

In this build, if an object less than the maximal inside copy size does not fit in the stack whitespace remainder and the whitelist holds a larger whitespace fragment that can accommodate the object, the stack whitespace remainder is trimmed and swapped into the whitelist and the top of the whitelist is swapped into the stack to receive the object. This has the effect of reducing the whitelist volume as desired and allowing inside copy to continue, Similarly, if an outside copyspace is in an overflow state with ≥768 whitespace bytes remaining is presented with an object that will not fit in the remaining whitespace the remainder can be trimmed and swapped into the whitelist if the top (largest) fragment on the whitelist can receive the object.

Another change introduced with this build was to use the volume of material evacuated to survivor space as best guess for the projected size survivor volume for the next nursery collection if that volume was less than the size of survivor space configured for the next collection, otherwise it uses the latter value as projected volume. This was introduced to accommodate applications like the database processing` benchmark, which presents very small live sets to the nursery collector at steady state (<1% of configured survivor size which is constrained to be at least %10 of nursery size). The evacuator controller begins to aggressively reduce whitespace tlh allocation sizes when the total volume of material copied into survivor space exceeds 80% of the projected surviving volume. Without this change, evacuator threads involved in this benchmark received maximal tlh allocations right up to the end of the nursery collection and tended to be left with large whitespace remainders at the end.

The enterprise benchmark was run 3x2 times interleaving evacuator and scavenger on lnxem64tvm10-resv.ottawa.ibm.com configured as for the January 31 2019 trials reported above. The other three benchmarks were run 5x2 times interleaving evacuator and scavenger on lnxem64tvm7-resv.ottawa.ibm.com configured as for January 31 2019 trials. This machine has the same hardware architecture (Intel(R) Xeon(R) CPU X5570@2.93GHz) and NUMA configuration as lnxem64tvm6-resv.ottawa.ibm.com (modulo different RAM sizes not material to these trials), which was used for these benchmarks in the January 31 trials.

I've included the inside% stats for evacuator. This metric measures the percentage of objects that are copied inside stack frames vs copied to outside copyspaces, Note that improvement in benchmark primary scores (score-A%) tends to be directly if not linearly related to evacuator inside% (correlation coefficient 0.83) and inversely related to evacuator cache-64% (correlation coefficient -0.28). This suggests that the relative volume of inside copying is a stronger determinant of benchmark performance than is cache-line containment. A likely explanation for this is that evacuator threads are able to discover new scan work as they recurse into objects copied ahead within stack frames and are able to perform a greater proportion of work without drawing from their own or other evacuator worklists.

I think that a similar process explains the advantage of scavenger's hierarchical copying vs breadth-first copying. Scavenger's deferred scancache acts in a similar manner to evacuator's bottom stack frame. When a scavenger thread runs out of scan work it will pull in the deferred scancache and start scanning where it previously left off, but will likely alias to the scancache that receives the first object that it copies and commence scanning hierarchically after deferring the pulled scancache again. In a nutshell, scavenger threads perform a pseudo-random walk that covers the intersection of the reference graph and evacuation space, discovering new work in the process and pulling from worklists only when the deferred scancache and both (survivor and tenure) local scancaches have been completely scanned.

Scavenger's deferred scancache may be displaced to the worklist if another scancache must be deferred, which means that any scancache can be cycled through the worklist circuit multiple times in the process of being scanned from base to end. So the inside% metric is difficult to measure for scavenger, because scancaches can appear >1 times on worklists, with less scan work available on each iteration. Where it is reported for previous trials it is overestimated to a greater or lesser extent.

ktbriggs commented 5 years ago

Here are final results from evacuator vs scavenger benchmarking with most recent builds. The material changes since the last report (Feb 21, above) relate to whitespace management. In these builds I have reverted the changes described in Feb 21 comment, reduced the tolerance for trimming remainder whitespace from outside copyspaces from 768 to 256 bytes, and permitted objects that overflow outside copyspaces to be redirected into stack whitespace if they fit in current stack whitespace. I have also introduced 3 flush conditions, which force all objects to be copied to outside copyspaces whenever one or more flush condition is present. The 3 flush conditions are reevaluated before each object evacuation. They are:

1- presence of one or more stalled evacuators 2- outside copyspace overflow (object size > copyspace remainder > 256 bytes) 3- stack overflow

The last flush condition forces depth-first stack operation while the stack grows and forces flushing to outside copyspaces as the stack winds down. This is intended to help parallelize evacuation of recursive structures with long recursing chains of objects such as might be encountered in, for example, XML/DOM processing contexts.

I report results from 3 full sets of benchmarking runs below. Each set was run on the same machines used in previously reported benchmarking runs eg, Feb 21. Each run used a different build, but the changes between these builds were not material to scavenger or evacuator performance -- there were no substantial changes to scavenger and changes to evacuator related only to some cleanup work to rename a class and remove a couple of unused artifacts. All of these builds were internal Linux/x86_64 builds for J9 Java VM (80).

Note (cpu%, below): 30 cores/gc threads applies to enterprise benchmarking runs only, other benchmarks were run with 8 cores/gc threads.

Build jvmxa6480-408175:

Build jvmxa6480-410164:

Build jvmxa6480-412811:

Build 412811 was produced from the evacuator-wip branches of my personal omr and openj9 repos at https://github.com/ktbriggs/omr/tree/evacuator-wip and https://github.com/ktbriggs/openj9/tree/evacuator-wip. These branches differ from the stable evacuator branches in the respective repos only insofar as the ...-wip branches are instrumented to produce summary output per nursery collection, instrumentation is disabled in the stable evacuator branches. The only source code difference between the evacuator-wip and evacuator branches is that in omr/evacuator-wip

#undef EVACUATOR_DEBUG_ALWAYS

is replaced with

#define EVACUATOR_DEBUG_ALWAYS

in omr/gc/base/Evacuator.hpp. The format and content of instrumentation output is described in #2361.

As things stand evacuator is stable and working as designed in builds for Linux/x86_64 and in 64-bit builds for other platforms. There is a bug in 32-bit builds (all platforms) that I believe relates to evacuator interactions with recent scavenger changes (this bug is not present in eclipse/{openj9-omr,openj9} builds). I will not have an opportunity to debug this or perform further work on OpenJ9/OMR in the foreseeable future.

I am closing this and #2361 as of now.