Closed danielmitterdorfer closed 8 years ago
One thing that I noted is that progress reporting has introduced an accidental bottleneck so I have implemented a --quiet
flag turning off all output and silencing Rally's log output too. This measure increased indexing throughput from 67677 docs/second (baseline) to 69269 docs / second on my machine (reported numbers: average across three successive runs for each setting).
Below are the raw numbers from three trial runs (all numbers in docs / second):
baseline | --quiet |
---|---|
68035 | 70259 |
67151 | 69379 |
67846 | 68170 |
This change eliminates a 2% difference in the above-described test scenario.
I've also noticed that over time the number of documents / second drops (remember we're not targeting ES itself but an nginx hosting static files). The image below shows how the indexing throughput (y-axis) drops over time (x-axis). The graph represents the average value of 5 trial runs with the upper and lower bound shown in light-gray. We have used 8 indexing threads on a machine with 4 physical cores (8 with hyperthreading).
The problem seems to be related to multithreading in Python. A trial run with 1, 2, 4 and 8 indexing threads revealed the following behavior:
We can clearly see that single-threaded indexing throughput (a) remains stable and (b) is higher than any multi-threaded indexing throughput (raw data).
We can see two effects here:
In order to verify hypothesis (2) above, we've replaced the original file reading code in driver.py
:
def _read_records(self, documents):
with open(documents, 'rt') as f:
yield from f
with:
def static_data_provider(self):
s = '{"name": "Costa de Xurius", "geonameid": 3038814, "feature_class": "T", "longitude": 1.48333, "timezone": "Europe/Andorra", "feature_code": "SLP", "dem": "1434", "country_code": "AD", "admin1_code": "00", "latitude": 42.5, "asciiname": "Costa de Xurius", "population": 0}'
for i in range(8647880):
yield s
def _read_records(self, documents):
yield from self.static_data_provider()
This yields the same number of documents but it is always the same one and we don't have any file I/O. When we run the benchmark again with 8 indexing threads, indexing throughput stabilizes around 80.000 documents / second (result of 3 successive trials):
Obviously, this only points to the problem but does not solve it yet. This will be tackled in the follow-up issue #58.
I have also evaluated whether we saturate the network connection (unlikely anyway as we just connect via loopback but I wanted to ensure it). Here are the results:
The raw data were gathered using netstat -i 1 -I lo0
during trial runs with Rally. We chose 1 and 8 indexing threads. What is odd is that we cannot detect the same pattern on network level that we experienced above.
I have also evaluated maximum throughput of the loopback interface on my machine using iperf3 by invoking iperf3 -s
(server mode) iperf3 -c 127.0.0.1 -t 128
(client mode):
[ ID] Interval Transfer Bandwidth
[ 4] 0.00-128.00 sec 707 GBytes 47.4 Gbits/sec sender
[ 4] 0.00-128.00 sec 707 GBytes 47.4 Gbits/sec receiver
During the trial I have also run netstat -i 1 -I lo0
. On average we received 508617 packets (5909641728 bytes) and sent 508617 packets (5909640028 bytes) with the same sampling frequency as for the other trials runs (once per second). In conclusion we reach a utilization of around 1% during trial runs and are not saturating the network interface at all (as expected).
A few quick comments:
Many thanks for your feedback! :) A few comments:
Be careful only sending the same doc over and over: yes, this avoid client side IO, but this also is much less work for Lucene
This is not a problem in this scenario because I took the benchmark candidate completely out of the system. I have setup an nginx which just returns static responses as fast as it can. This is to stress the benchmark driver (i.e. Rally) so I can find bottlenecks in Rally itself.
It's curious you don't see any gains going from 1 to 2 client side threads
I think this is also due to benchmark setup (stressing the benchmark driver as much as possible). If I'd have just a single Elasticsearch node, it would probably be the bottleneck and thus we might gain something out of increasing the number of threads.
I would be very surprised if GIL was the issue here, unless IO is very slow (OS is unable to pre-cache the docs source quickly enough). If you make a python client that simply reads from the file using N threads (and does no indexing), what throughput do you get?
I am about to do that and also contrast this with multiprocessing (as opposed to multithreading) in preparation of #58.
It is expected that ingest rate will taper over time, as merges wake up and steal a good chuck of IO/CPU. However, this effect should stabilize and the ingest rate should flatten. In the 1 thread case it could be you are not saturating the full resources of this box and so the merges look "free" (steal no resources from the 1 (--> 5 way) indexing). But this counters my 2nd bullet ;)
Thanks! I'll check this once I benchmark again with the benchmark candidate instead of nginx. I want to pin the benchmark candidate to different cores as Rally and look what happens then (but for this I have to switch to Linux as OS X does not seem to allow pinning).
I have measured warmup times with a new JIT profiler (see #43).
The diagram shows the number of JIT compile events (regardless whether it is a C1 or a C2 event). We can clearly see that after around 3 minutes the number of compile events approaches zero and are stable. 11 minutes into the benchmark we have another spike which is caused by switching from indexing to the search benchmark which obviously changes the runtime profile of the benchmark candidate.
From this graph, I'd conclude that the warmup times for the search benchmark is too short but further investigation is needed.
This timeline has been produced as follows:
esrally --telemetry=jit
./launchHeadless.sh -t $JIT_OUT_FILE > es-timeline.txt
import sys
current_bucket = ""
bucket_count = 0
idx = 0
print("idx,time,num_compile_events")
for line in sys.stdin:
if line[13:21] == "Compiled":
bucket = line[:8]
if current_bucket == bucket:
bucket_count += 1
else:
if not current_bucket == "":
print("%d,%s,%d" % (idx, current_bucket, bucket_count))
current_bucket = bucket
bucket_count = 1
idx += 1
We should also investigate why we get different segment counts when we run the same benchmark on a specific Elasticsearch version. Here is an example of the variance in index segment counts we get when running the same benchmark five times: 123, 109, 102, 105, 96 (related to #52)
Regarding my previous comment, the issue is related to not doing a force-merge after indexing. We do this now by default (see #52) but leave the user a choice (can be configured in track setup).
I did a cross-validation of the reported GC times and have checked four data sources:
GET _nodes/stats/_all?level=shards
)I ran an index-only benchmark with esrally --pipeline=benchmark-only --track-setup=4gheap
Also, I started Elasticsearch with the following JVM options
`-XX:+UnlockDiagnosticVMOptions -XX:+UnlockCommercialFeatures -XX:+DebugNonSafepoints -XX:+FlightRecorder -XX:FlightRecorderOptions=defaultrecording=true,disk=true,dumponexit=true,dumponexitpath=/Users/dm/recording.jfr -Xms256m -Xmx1g -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=logs/heapdump.hprof -Delasticsearch -ea -Des.path.home=/Users/dm/Projects/elasticsearch/data/master -Des.node.attr=test -Xloggc:es-gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime```
Note that contrary to the track setup name I ran with just 1 GB. This is not relevant here; I just needed an index-only benchmark. This would only matter if Rally had provisioned the cluster (which it did not).
Here are the reported numbers:
young generation GC time:
Rally (command line) | Node Stats API | Java Flight Recorder | GC log |
---|---|---|---|
79,416 ms | 79,452 ms (18,450 collections) | 89,003 ms (18,450 collections) | 80,853 ms (18,451 collections) |
old generation GC time:
Rally (command line) | Node Stats API | Java Flight Recorder | GC log |
---|---|---|---|
23,964 ms | 23,964 ms (500 collections) | 156,630 ms (500 collections) | 23,989 ms (500 collections) |
Note that Java Flight Recorder is off by around 10 seconds for the total young gen GC time and around 133 seconds for the total old gen GC time. It is not yet clear why the difference is so high. Summing up all individual reported old-gen events from Java Flight Recorder produces 23.288 ms (excluding the CMS initial mark phase) and 24.031 ms (including the CMS initial mark phase) which is much more closer to the other numbers.
Also the cross-validation of the GC log with events from Java Flight Recorder shows similar times:
2016-03-02T15:21:45.462-0100: 3.495: [GC (CMS Initial Mark) [1 CMS-initial-mark: 2882K(174784K)] 9013K(253440K), 0.0009459 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2016-03-02T15:21:45.463-0100: 3.496: Total time for which application threads were stopped: 0.0010216 seconds, Stopping threads took: 0.0000277 seconds
2016-03-02T15:21:45.463-0100: 3.496: [CMS-concurrent-mark-start]
2016-03-02T15:21:45.467-0100: 3.500: [CMS-concurrent-mark: 0.004/0.004 secs] [Times: user=0.02 sys=0.00, real=0.00 secs]
2016-03-02T15:21:45.467-0100: 3.500: [CMS-concurrent-preclean-start]
2016-03-02T15:21:45.468-0100: 3.500: [CMS-concurrent-preclean: 0.001/0.001 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
2016-03-02T15:21:45.468-0100: 3.500: Application time: 0.0043979 seconds
2016-03-02T15:21:45.468-0100: 3.501: [GC (CMS Final Remark) [YG occupancy: 6130 K (78656 K)]2016-03-02T15:21:45.468-0100: 3.501: [Rescan (parallel) , 0.0011471 secs]2016-03-02T15:21:45.469-0100: 3.502: [weak refs processing, 0.0000350 secs]2016-03-02T15:21:45.469-0100: 3.502: [class unloading, 0.0067130 secs]2016-03-02T15:21:45.476-0100: 3.508: [scrub symbol table, 0.0030474 secs]2016-03-02T15:21:45.479-0100: 3.512: [scrub string table, 0.0002564 secs][1 CMS-remark: 2882K(174784K)] 9013K(253440K), 0.0116463 secs] [Times: user=0.04 sys=0.00, real=0.02 secs]
2016-03-02T15:21:45.480-0100: 3.512: Total time for which application threads were stopped: 0.0118244 seconds, Stopping threads took: 0.0001361 seconds
2016-03-02T15:21:45.480-0100: 3.512: [CMS-concurrent-sweep-start]
2016-03-02T15:21:45.481-0100: 3.514: [CMS-concurrent-sweep: 0.002/0.002 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
2016-03-02T15:21:45.481-0100: 3.514: [CMS-concurrent-reset-start]
2016-03-02T15:21:45.486-0100: 3.518: [CMS-concurrent-reset: 0.004/0.004 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
2016-03-02T15:21:46.027-0100: 4.060: Application time: 0.5478287 seconds
This corresponds to the following GC event in Java Flight Recorder:
Event Type | Name | Duration |
---|---|---|
GC Phase Pause | Initial Mark | 966.021 µs |
GC Phase Pause | Final Mark | 11,663.502 µs |
So, we have 966 µs (JFR) vs. 946 µs (GC log) for the initial mark phase and 11.663 ms (JFR) vs. 11.824 ms (GC log).
The gist of this analysis is:
So I did another run with the serial collector (i.e. -XX:+UseSerialGC
) as I suspected that JFR is showing the total time for a concurrent collector (CMS) and indeed JFR's total times are now in line with other data sources (just showing the node stats and JFR here for brevity):
Source | young generation GC time | young generation GC events | old generation GC time | old generation GC events |
---|---|---|---|---|
JFR | 123,933 ms | 9,929 | 47,055 ms | 228 |
Node Stats | 124,208 ms | 9,929 | 47,246 ms | 228 |
This supports the theory that the total times reported in JFR are rather total GC run time instead of the total time that the application is paused due to GC (i.e. for a concurrent collector, this includes also the time the GC runs concurrently with application threads) or graphically put (red == GC threads, black == application threads):
The total times in JFR include all four phases for CMS, whereas for individual events only the "initial mark" and "remark" phases are reported (which are the only phases where all application threads are halted).
I've talked to the Oracle Mission Control team regarding the differences between Mission Control and the GC log. It turns out that the "Total GC Time" as reported by Oracle Mission Control 5.5 is the difference between the start of a GC run until the end (also including the time the GC is running concurrently with the application).
We should check for accidental bottlenecks in the benchmark driver, problems with system setup (like #3) etc. We should also check the overhead of different profilers (see #19) and the metrics store (see #8).
This just serves as a reminder ticket and it is expected that lots of related tickets are created after initial analysis.
Some concrete things to check:
Experimental setup
As a first step, I have recorded all HTTP requests and responses that are issued during the benchmark and have mocked the benchmark candidate with an nginx returning static responses (under the assumption that the bottleneck is now the benchmark driver).
The assumption that the benchmark driver is the bottleneck is supported by the result of the following benchmark with wrk against nginx:
post.lua represents one bulk request with 5000 documents. So, considering that we reach on average 174 requests / second, this is the equivalent of 174 * 5.000 = 870.000 documents / second. If we reach numbers in this range, we should consider our mock Elasticsearch (nginx) the bottleneck, else Rally.
8 test threads were chosen because this is currently the maximum number of client threads that are used by Rally on the benchmark machine for this ticket (i.e. my notebook).