Check for accidental bottlenecks

danielmitterdorfer commented 8 years ago

We should check for accidental bottlenecks in the benchmark driver, problems with system setup (like #3) etc. We should also check the overhead of different profilers (see #19) and the metrics store (see #8).

This just serves as a reminder ticket and it is expected that lots of related tickets are created after initial analysis.

Some concrete things to check:

[x] Behavior during indexing: Effects of number of indexing threads, what numbers are reported and how are they calculated? how are time intervals measured? What is the effect of different bulk sizes? Where are our bottlenecks?
[X] Latency: What is the overhead for latency measurement? Are we prone to coordinated omission (I almost certainly think so and getting experimental evidence is easier once #62 is in)? What numbers are gathered and reported? This is tracked in the follow-up ticket #64
[x] How are system metrics gathered and reported (e.g. CPU stats)? Can we cross-validate their correctness (e.g. GC times can be cross-validated with Java flight recorder)?
[X] What is the overhead of different profilers in Rally? (Postponed and separately tracked in #66)
[x] Do we account for proper warmup so we only measure a stable system?
Experimental setup

As a first step, I have recorded all HTTP requests and responses that are issued during the benchmark and have mocked the benchmark candidate with an nginx returning static responses (under the assumption that the bottleneck is now the benchmark driver).

The assumption that the benchmark driver is the bottleneck is supported by the result of the following benchmark with wrk against nginx:

dm@io:rally/performance_verification $ wrk -t8 -c8 -d128s -s post.lua http://127.0.0.1:8200/geonames/type/_bulk
Running 2m test @ http://127.0.0.1:8200/geonames/type/_bulk
  8 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.12ms  427.35us   2.78ms   72.78%
    Req/Sec   174.00    337.51     1.04k    88.89%
  158 requests in 2.13m, 124.35MB read

post.lua represents one bulk request with 5000 documents. So, considering that we reach on average 174 requests / second, this is the equivalent of 174 * 5.000 = 870.000 documents / second. If we reach numbers in this range, we should consider our mock Elasticsearch (nginx) the bottleneck, else Rally.

8 test threads were chosen because this is currently the maximum number of client threads that are used by Rally on the benchmark machine for this ticket (i.e. my notebook).

danielmitterdorfer commented 8 years ago

One thing that I noted is that progress reporting has introduced an accidental bottleneck so I have implemented a --quiet flag turning off all output and silencing Rally's log output too. This measure increased indexing throughput from 67677 docs/second (baseline) to 69269 docs / second on my machine (reported numbers: average across three successive runs for each setting).

Below are the raw numbers from three trial runs (all numbers in docs / second):

baseline	`--quiet`
68035	70259
67151	69379
67846	68170

This change eliminates a 2% difference in the above-described test scenario.

danielmitterdorfer commented 8 years ago

I've also noticed that over time the number of documents / second drops (remember we're not targeting ES itself but an nginx hosting static files). The image below shows how the indexing throughput (y-axis) drops over time (x-axis). The graph represents the average value of 5 trial runs with the upper and lower bound shown in light-gray. We have used 8 indexing threads on a machine with 4 physical cores (8 with hyperthreading).

docs_per_second

raw data

The problem seems to be related to multithreading in Python. A trial run with 1, 2, 4 and 8 indexing threads revealed the following behavior:

docs_per_second_varying_thread_counts

We can clearly see that single-threaded indexing throughput (a) remains stable and (b) is higher than any multi-threaded indexing throughput (raw data).

We can see two effects here:

Indexing throughput does not increase but decrease when using multiple threads. The current hypothesis is that we might be contending for the GIL (to be verified)
The indexing throughput drops over time. This effect is amplified the more indexing threads we use. The current hypothesis is that we're contending for some resource when reading the data file from multiple threads. This hypothesis is backed by the following experiment.

In order to verify hypothesis (2) above, we've replaced the original file reading code in driver.py:

def _read_records(self, documents):
    with open(documents, 'rt') as f:
        yield from f

with:

def static_data_provider(self):
    s = '{"name": "Costa de Xurius", "geonameid": 3038814, "feature_class": "T", "longitude": 1.48333, "timezone": "Europe/Andorra", "feature_code": "SLP", "dem": "1434", "country_code": "AD", "admin1_code": "00", "latitude": 42.5, "asciiname": "Costa de Xurius", "population": 0}'
    for i in range(8647880):
        yield s

def _read_records(self, documents):
    yield from self.static_data_provider()

This yields the same number of documents but it is always the same one and we don't have any file I/O. When we run the benchmark again with 8 indexing threads, indexing throughput stabilizes around 80.000 documents / second (result of 3 successive trials):

docs_per_second_mocked_file_read

raw data

Obviously, this only points to the problem but does not solve it yet. This will be tackled in the follow-up issue #58.

danielmitterdorfer commented 8 years ago

I have also evaluated whether we saturate the network connection (unlikely anyway as we just connect via loopback but I wanted to ensure it). Here are the results:

lo_saturation_bytes lo_saturation_packets

The raw data were gathered using netstat -i 1 -I lo0 during trial runs with Rally. We chose 1 and 8 indexing threads. What is odd is that we cannot detect the same pattern on network level that we experienced above.

I have also evaluated maximum throughput of the loopback interface on my machine using iperf3 by invoking iperf3 -s (server mode) iperf3 -c 127.0.0.1 -t 128 (client mode):

[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-128.00 sec   707 GBytes  47.4 Gbits/sec                  sender
[  4]   0.00-128.00 sec   707 GBytes  47.4 Gbits/sec                  receiver

During the trial I have also run netstat -i 1 -I lo0. On average we received 508617 packets (5909641728 bytes) and sent 508617 packets (5909640028 bytes) with the same sampling frequency as for the other trials runs (once per second). In conclusion we reach a utilization of around 1% during trial runs and are not saturating the network interface at all (as expected).

mikemccand commented 8 years ago

A few quick comments:

Be careful only sending the same doc over and over: yes, this avoid client side IO, but this also is much less work for Lucene since the same terms keep showing up over and over, since compression will work incredibly well on the stored fields, etc. So it's not really a proper A/B test.
It's curious you don't see any gains going from 1 to 2 client side threads ... in the past I've seen sizable gains here (though, on 24 and 72 core box). Since at defaults there are 5 shards, a single client thread is using 5 way concurrency on the server, so maybe 1 thread is plenty for your 4 core (8 w/ HT) box?
I would be very surprised if GIL was the issue here, unless IO is very slow (OS is unable to pre-cache the docs source quickly enough). If you make a python client that simply reads from the file using N threads (and does no indexing), what throughput do you get?
It is expected that ingest rate will taper over time, as merges wake up and steal a good chuck of IO/CPU. However, this effect should stabilize and the ingest rate should flatten. In the 1 thread case it could be you are not saturating the full resources of this box and so the merges look "free" (steal no resources from the 1 (--> 5 way) indexing). But this counters my 2nd bullet ;)

danielmitterdorfer commented 8 years ago

Many thanks for your feedback! :) A few comments:

Be careful only sending the same doc over and over: yes, this avoid client side IO, but this also is much less work for Lucene

This is not a problem in this scenario because I took the benchmark candidate completely out of the system. I have setup an nginx which just returns static responses as fast as it can. This is to stress the benchmark driver (i.e. Rally) so I can find bottlenecks in Rally itself.

It's curious you don't see any gains going from 1 to 2 client side threads

I think this is also due to benchmark setup (stressing the benchmark driver as much as possible). If I'd have just a single Elasticsearch node, it would probably be the bottleneck and thus we might gain something out of increasing the number of threads.

I would be very surprised if GIL was the issue here, unless IO is very slow (OS is unable to pre-cache the docs source quickly enough). If you make a python client that simply reads from the file using N threads (and does no indexing), what throughput do you get?

I am about to do that and also contrast this with multiprocessing (as opposed to multithreading) in preparation of #58.

It is expected that ingest rate will taper over time, as merges wake up and steal a good chuck of IO/CPU. However, this effect should stabilize and the ingest rate should flatten. In the 1 thread case it could be you are not saturating the full resources of this box and so the merges look "free" (steal no resources from the 1 (--> 5 way) indexing). But this counters my 2nd bullet ;)

Thanks! I'll check this once I benchmark again with the benchmark candidate instead of nginx. I want to pin the benchmark candidate to different cores as Rally and look what happens then (but for this I have to switch to Linux as OS X does not seem to allow pinning).

danielmitterdorfer commented 8 years ago

I have measured warmup times with a new JIT profiler (see #43).

jit_hist

The diagram shows the number of JIT compile events (regardless whether it is a C1 or a C2 event). We can clearly see that after around 3 minutes the number of compile events approaches zero and are stable. 11 minutes into the benchmark we have another spike which is caused by switching from indexing to the search benchmark which obviously changes the runtime profile of the benchmark candidate.

From this graph, I'd conclude that the warmup times for the search benchmark is too short but further investigation is needed.

This timeline has been produced as follows:

Run a regular benchmark with the new Rally JIT compiler: esrally --telemetry=jit
Analyze the JIT compiler log with JITWatch: ./launchHeadless.sh -t $JIT_OUT_FILE > es-timeline.txt
Postprocess the output with the Python script below
Load the data in R and create the line chart above

import sys

current_bucket = ""
bucket_count = 0
idx = 0
print("idx,time,num_compile_events")
for line in sys.stdin:
    if line[13:21] == "Compiled":
        bucket = line[:8]
        if current_bucket == bucket:
            bucket_count += 1
        else:
            if not current_bucket == "":
                print("%d,%s,%d" % (idx, current_bucket, bucket_count))
            current_bucket = bucket
            bucket_count = 1
            idx += 1

danielmitterdorfer commented 8 years ago

We should also investigate why we get different segment counts when we run the same benchmark on a specific Elasticsearch version. Here is an example of the variance in index segment counts we get when running the same benchmark five times: 123, 109, 102, 105, 96 (related to #52)

danielmitterdorfer commented 8 years ago

Regarding my previous comment, the issue is related to not doing a force-merge after indexing. We do this now by default (see #52) but leave the user a choice (can be configured in track setup).

danielmitterdorfer commented 8 years ago

Verification of reported GC times

I did a cross-validation of the reported GC times and have checked four data sources:

The GC times as reported by Rally on the command line
The GC times as reported by the node stats API (GET _nodes/stats/_all?level=shards)
Total GC times as reported by Java Flight Recorder
GC times as reported by the GC log

Test setup

I ran an index-only benchmark with esrally --pipeline=benchmark-only --track-setup=4gheap

Also, I started Elasticsearch with the following JVM options

`-XX:+UnlockDiagnosticVMOptions -XX:+UnlockCommercialFeatures -XX:+DebugNonSafepoints -XX:+FlightRecorder -XX:FlightRecorderOptions=defaultrecording=true,disk=true,dumponexit=true,dumponexitpath=/Users/dm/recording.jfr -Xms256m -Xmx1g -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=logs/heapdump.hprof -Delasticsearch -ea -Des.path.home=/Users/dm/Projects/elasticsearch/data/master -Des.node.attr=test -Xloggc:es-gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime```

Note that contrary to the track setup name I ran with just 1 GB. This is not relevant here; I just needed an index-only benchmark. This would only matter if Rally had provisioned the cluster (which it did not).

Results

Here are the reported numbers:

young generation GC time:

Rally (command line)	Node Stats API	Java Flight Recorder	GC log
79,416 ms	79,452 ms (18,450 collections)	89,003 ms (18,450 collections)	80,853 ms (18,451 collections)

old generation GC time:

Rally (command line)	Node Stats API	Java Flight Recorder	GC log
23,964 ms	23,964 ms (500 collections)	156,630 ms (500 collections)	23,989 ms (500 collections)

Note that Java Flight Recorder is off by around 10 seconds for the total young gen GC time and around 133 seconds for the total old gen GC time. It is not yet clear why the difference is so high. Summing up all individual reported old-gen events from Java Flight Recorder produces 23.288 ms (excluding the CMS initial mark phase) and 24.031 ms (including the CMS initial mark phase) which is much more closer to the other numbers.

Also the cross-validation of the GC log with events from Java Flight Recorder shows similar times:

2016-03-02T15:21:45.462-0100: 3.495: [GC (CMS Initial Mark) [1 CMS-initial-mark: 2882K(174784K)] 9013K(253440K), 0.0009459 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
2016-03-02T15:21:45.463-0100: 3.496: Total time for which application threads were stopped: 0.0010216 seconds, Stopping threads took: 0.0000277 seconds
2016-03-02T15:21:45.463-0100: 3.496: [CMS-concurrent-mark-start]
2016-03-02T15:21:45.467-0100: 3.500: [CMS-concurrent-mark: 0.004/0.004 secs] [Times: user=0.02 sys=0.00, real=0.00 secs] 
2016-03-02T15:21:45.467-0100: 3.500: [CMS-concurrent-preclean-start]
2016-03-02T15:21:45.468-0100: 3.500: [CMS-concurrent-preclean: 0.001/0.001 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
2016-03-02T15:21:45.468-0100: 3.500: Application time: 0.0043979 seconds
2016-03-02T15:21:45.468-0100: 3.501: [GC (CMS Final Remark) [YG occupancy: 6130 K (78656 K)]2016-03-02T15:21:45.468-0100: 3.501: [Rescan (parallel) , 0.0011471 secs]2016-03-02T15:21:45.469-0100: 3.502: [weak refs processing, 0.0000350 secs]2016-03-02T15:21:45.469-0100: 3.502: [class unloading, 0.0067130 secs]2016-03-02T15:21:45.476-0100: 3.508: [scrub symbol table, 0.0030474 secs]2016-03-02T15:21:45.479-0100: 3.512: [scrub string table, 0.0002564 secs][1 CMS-remark: 2882K(174784K)] 9013K(253440K), 0.0116463 secs] [Times: user=0.04 sys=0.00, real=0.02 secs] 
2016-03-02T15:21:45.480-0100: 3.512: Total time for which application threads were stopped: 0.0118244 seconds, Stopping threads took: 0.0001361 seconds
2016-03-02T15:21:45.480-0100: 3.512: [CMS-concurrent-sweep-start]
2016-03-02T15:21:45.481-0100: 3.514: [CMS-concurrent-sweep: 0.002/0.002 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
2016-03-02T15:21:45.481-0100: 3.514: [CMS-concurrent-reset-start]
2016-03-02T15:21:45.486-0100: 3.518: [CMS-concurrent-reset: 0.004/0.004 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
2016-03-02T15:21:46.027-0100: 4.060: Application time: 0.5478287 seconds

This corresponds to the following GC event in Java Flight Recorder:

Event Type	Name	Duration
GC Phase Pause	Initial Mark	966.021 µs
GC Phase Pause	Final Mark	11,663.502 µs

So, we have 966 µs (JFR) vs. 946 µs (GC log) for the initial mark phase and 11.663 ms (JFR) vs. 11.824 ms (GC log).

Summary

The gist of this analysis is:

GC times as reported by Rally are accurate enough
Be wary when looking at total GC times in Java Flight Recorder

danielmitterdorfer commented 8 years ago

So I did another run with the serial collector (i.e. -XX:+UseSerialGC) as I suspected that JFR is showing the total time for a concurrent collector (CMS) and indeed JFR's total times are now in line with other data sources (just showing the node stats and JFR here for brevity):

Source	young generation GC time	young generation GC events	old generation GC time	old generation GC events
JFR	123,933 ms	9,929	47,055 ms	228
Node Stats	124,208 ms	9,929	47,246 ms	228

This supports the theory that the total times reported in JFR are rather total GC run time instead of the total time that the application is paused due to GC (i.e. for a concurrent collector, this includes also the time the GC runs concurrently with application threads) or graphically put (red == GC threads, black == application threads):

cms-gc

The total times in JFR include all four phases for CMS, whereas for individual events only the "initial mark" and "remark" phases are reported (which are the only phases where all application threads are halted).

danielmitterdorfer commented 8 years ago

I've talked to the Oracle Mission Control team regarding the differences between Mission Control and the GC log. It turns out that the "Total GC Time" as reported by Oracle Mission Control 5.5 is the difference between the start of a GC run until the end (also including the time the GC is running concurrently with the application).

elastic / rally