Indexing performance degradation in 6.0.0-beta1

elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine

https://www.elastic.co/products/elasticsearch

Other

70.93k stars 24.9k forks source link

Indexing performance degradation in 6.0.0-beta1 #26339

Closed muralikpbhat closed 7 years ago

muralikpbhat commented 7 years ago

Describe the feature: Indexing performance

Elasticsearch version (bin/elasticsearch --version): 6.0.0-beta1

Plugins installed: none

JVM version (java -version): openjdk version "1.8.0_131"

OS version (uname -a if on a Unix-like system): Linux 4.9.32-15.41.amzn1.x86_64 # 1 SMP Thu Jun 22 06:20:54 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Host: i3.16xl EC2 instance with 31GB heap configured, indexing buffer size of 50%, and data.path spread across 8 disks.

Description of the problem including expected versus actual behavior: Indexing throughput is degraded by up to 50% compared to 5.5.2. Seeing hotspots in LocalCheckpointTracker.margSeqNoAsCompleted(), generetaSeqNo() etc. Following shows the results for different tracks :

ES versions	5.5.2	6.0.0-beta1
genomes	178940 docs/s	88310 docs/s
nyc_taxis	208638 docs/s	150481 docs/s
logging	363691 docs/s	289247 docs/s

Steps to reproduce:

Restrict the rally challenge of each track to the following: { "name": "append-no-conflicts-just-index", "description": "benchmark", "index-settings": { "index.number_of_replicas": 0, "index.number_of_shards": 8, "index.refresh_interval": "100s", "index.translog.flush_threshold_size": "4gb", "index.translog.durability": "async", "index.translog.sync_interval": "240s" }, "schedule": [ { "operation": "index", "warmup-time-period": 240, "clients": 48 } ] }
Run the challenge against 5.5.2 and 6.0.0-beta1 versions. esrally --pipeline benchmark-only --track=nyc_taxis --challenge append-no-conflicts-just-index --target-hosts=XX.XX.XX.XX:9200

bleskes commented 7 years ago

thx for trying 6.0.0 - I added you to the pioneer program.

Seeing hotspots in LocalCheckpointTracker.margSeqNoAsCompleted(), generetaSeqNo()

Those methods are indeed synchronization points as we assumed they would be much lighter than indexing. Can you share the output that make you conclude this?

"index.translog.flush_threshold_size": "4gb", "index.translog.durability": "async", "index.translog.sync_interval": "240s"

This indeed removes a common bottle neck, shifting it to another place. Can you try without these? I want to confirm things become closer then - as our own benchmarks suggest.

coderplay commented 7 years ago

The degradation is caused by this line introduced by this commit whose original purpose was doing refactoring, but it added a lock in the critical path of indexing.

danielmitterdorfer commented 7 years ago

Thanks for reporting this! We've reproduced the issue on our nightly benchmarking hardware with a few modifications:

16GB heap configured (our machines have "only" 32GB)
32 clients

I have used the nyc_taxis track and I've seen the following median throughput:

6.0.0-beta1: 92261 docs/s
5.5.2: 98220 docs/s

While the effect is not as pronounced as in your case, it is still significant. For 6.0.0-beta1, we are seeing lock contention in LocalCheckpointTracker where locks were held for roughly 4s in total during the benchmark (for comparison: locks on TranslogWriter, which is the lock with the highest contention, were held for a total of 49 seconds).

In 80 out of 94 captured traces, the lock was held via LocalCheckpointTracker#generateSeqNo(). I'll implement this method in a lock-free manner.

LocalCheckpointTracker.markSeqNoAsCompleted(long) were held 14 times for a total of 580ms. As this method is (a) a lot trickier to implement lock-free and (b) contributing less to contention (in our tests), I'll only reduce the scope of the lock a bit.

danielmitterdorfer commented 7 years ago

Repeating my comment on the related PR #26362 here:

After running benchmarks in our nightly benchmarking environment, we see the following median throughput for the nyc_taxis track (24 clients):

5.5.2: 91773 docs/s
With #26362: 82125 docs/s
Without #26362: 82258 docs/s

In summary: This did not improve the situation.

We do, however, see that with this PR the monitor lock of org.elasticsearch.index.seqno.LocalCheckpointTracker is now blocked for 10.3 seconds during the benchmark (before ~ 4 seconds). In the majority of cases, the affected method is #markSeqNoAsCompleted(long). Hence, we'll look into improving #markSeqNoAsCompleted(long) too.

coderplay commented 7 years ago

@danielmitterdorfer Try more powerful machine. More powerful machine will gain better parallelism if there is no lock contention. The throughput difference will be larger.

danielmitterdorfer commented 7 years ago

Try more powerful machine. More powerful machine will gain better parallelism if there is no lock contention. The throughput difference will be larger.

I totally agree to this statement but for the time being, it's just important that we see a measurable difference due to the same hotspots that you are describing. When we're confident that the problem is fixed, I think it makes sense to run another benchmark on a more powerful machine.

danielmitterdorfer commented 7 years ago

To summarize the work so far:

I've reimplemented LocalCheckpointTracker so that it is completely lock-free. In the default case, it even avoids notification if it is certain that there are no waiting threads anyway (see LocalCheckpointTracker#waitForOpsToComplete(long)).

I've ran the nyc_taxis track with the following challenge:

{
  "name": "index-contention",
  "description": "Indexes the whole document corpus with more clients",
  "index-settings": {
    "index.number_of_replicas": 0,
    "index.number_of_shards": 8,
    "index.refresh_interval": "100s",
    "index.translog.flush_threshold_size": "4gb",
    "index.translog.durability": "async",
    "index.translog.sync_interval": "240s"
  },
  "schedule": [
    {
      "operation": "index",
      "warmup-time-period": 240,
      "clients": 24
    }
  ]
}

Note: So far I could never finish any benchmark with the originally stated number of 48 clients without running into bulk rejections and I consider a benchmark with an error rate > 0 invalid. Hence, I've reduced the number of clients to 24.

On our nightly benchmarking hardware (4 cores, 32GB) with a heap size of 16GB and indices.memory.index_buffer_size: 50%, we see the following results (median indexing throughput):

5.5.2: 90844 docs/s
6.0.0-beta2-SNAPSHOT (commit hash elastic/elasticsearch@41675c1): 83611 docs/s
6.0.0-beta2-SNAPSHOT + lock-free changes: 83208 docs/s

So for this configuration we do not see any noticeable difference between different implementations on 6.0.0-beta2 that cannot be attributed to run-to-run variance.

One significant contributor is contention in TranslogWriter due to frequent flushing of checkpoints. Therefore I also ran a benchmark with the index setting index.translog.generation_threshold_size: "2gb" (default is 64mb) and we see indeed an improvement. I only tested the configuration "6.0.0-beta2-SNAPSHOT + lock free changes" which resulted in a median indexing throughput of 86937 docs/s. Still not there but much closer. We also see less contention in TranslogWriter as checkpoints are flushed less often.

Finally, I also ran a benchmark on the same hardware as stated originally, i.e. i3.16xl with 8 data paths (formatted with ext4), a heap size of 31GB and indices.memory.index_buffer_size: 50%. The load generator (Rally) was on a c3.8xlarge in the same availability zone. I've removed all data and trimmed the disks before each run.

5.5.2: 240115 docs/s
6.0.0-beta2-SNAPSHOT (commit hash elastic/elasticsearch@41675c1): 144277 docs/s
6.0.0-beta2-SNAPSHOT + lock-free changes: 156933 docs/s

In this case, we can see a noticeable improvement due the lock-free implementation of LocalCheckpointTracker. An additional run with index.translog.generation_threshold_size: "2gb" did not show any improvement though (median indexing throughput 153709 docs/s).

As far as I can tell at the moment, after resolving problems in LocalCheckpointTracker the main bottleneck seems to be caused by heavy contention in TranslogWriter but this is subject to further investigation.

danielmitterdorfer commented 7 years ago

Further experiments have shown that a major contributor to higher contention in 6.0.0 is caused by a higher number threads in the system due to #20874. A i3.16xl has 64 cores which result in 64 bulk indexing threads on 6.0.0 whereas it was bounded to 32 bulk indexing threads on 5.x. Thus setting processors: 32 in config/elasticsearch.yml we see now a much saner behavior w.r.t. to lock contention.

With the same setup as before (same instance types but different instances) we get the following median indexing throughput:

5.5.2: 251121 docs/s
6.0.0-beta2-SNAPSHOT + lock-free changes with default processor settings: 152846 docs/s
6.0.0-beta2-SNAPSHOT + lock-free changes with processors: 32: 136893 docs/s

The top-contended lock in 6.0.0 is the monitor lock on TranslogWriter. Across the three benchmark candidates, this lock was contended the following number of times during the benchmark:

5.5.2: 1,050
6.0.0-beta2-SNAPSHOT + lock-free changes with default processor settings: 143,030
6.0.0-beta2-SNAPSHOT + lock-free changes with processors: 32: 13,708

While this measure has drastically reduced the number of times this lock is contended, we still see a significant difference in indexing throughput for these settings and hardware.

danielmitterdorfer commented 7 years ago

After further benchmarks have hinted that the cause may be Lucene, I've started bisecting. Indeed, we see a major drop in median indexing throughput after upgrading to Lucene 7 (measured on an i3.16xlarge):

last revision with Lucene 6.x (revision elastic/elasticsearch@f217eb8ad8): 241561 docs/s
first revision with Lucene 7.0 (revision elastic/elasticsearch@4632661bc7): 128667 docs/s

Here is the achieved median indexing throughput on a 4 core machine for comparison:

last revision with Lucene 6.x: 83898 docs/s
first revision with Lucene 7.0 : 82558 docs/s

So the drop we see is significantly lower on machines with lower number of cores.

The cause is not yet clear; two possibilities are JIT compiler behavior or lock contention.

danielmitterdorfer commented 7 years ago

The performance drop is caused by a write to a(n unused) variable in Lucene. We've seen that a significant portion of time is spent in related code. Here is a snippet of a JMH microbenchmark with the perfasm profiler:

....[Hottest Region 1]..............................................................................
C2, level 4, org.apache.lucene.index.DefaultIndexingChain$PerField::invert, version 1231 (792 bytes)

                          0x00007f52d56b026a: xor    %r10d,%r10d        ;*invokevirtual isAssignableFrom
                                                                        ; - org.apache.lucene.util.AttributeSource::addAttribute@28 (line 207)
                                                                        ; - org.apache.lucene.document.Field$BinaryTokenStream::&lt;init&gt;@8 (line 512)
                                                                        ; - org.apache.lucene.document.Field::tokenStream@82 (line 491)
                                                                        ; - org.apache.lucene.index.DefaultIndexingChain$PerField::invert@99 (line 729)
                          0x00007f52d56b026d: mov    $0x8,%r11d         ;*invokeinterface iterator
                                                                        ; - org.apache.lucene.util.AttributeSource::getCurrentState@46 (line 254)
                                                                        ; - org.apache.lucene.util.AttributeSource::clearAttributes@1 (line 269)
                                                                        ; - org.apache.lucene.document.Field$BinaryTokenStream::incrementToken@10 (line 532)
                                                                        ; - org.apache.lucene.index.DefaultIndexingChain$PerField::invert@153 (line 736)
  0.00%    0.02%       ↗  0x00007f52d56b0273: test   %r10,%r10
                  ╭    │  0x00007f52d56b0276: je     0x00007f52d56b0292  ;*getfield fieldsData
                  │    │                                                ; - org.apache.lucene.document.Field::binaryValue@1 (line 441)
                  │    │                                                ; - org.apache.lucene.document.Field::tokenStream@65 (line 487)
                  │    │                                                ; - org.apache.lucene.index.DefaultIndexingChain$PerField::invert@99 (line 729)
  0.00%    0.00%  │    │  0x00007f52d56b0278: mov    (%r11),%rsi        ;*getfield next
                  │    │                                                ; - java.util.HashMap::getNode@137 (line 580)
                  │    │                                                ; - java.util.LinkedHashMap::get@6 (line 440)
                  │    │                                                ; - org.apache.lucene.util.AttributeSource::getAttribute@6 (line 245)
                  │    │                                                ; - org.apache.lucene.index.DefaultIndexingChain$PerField::invert@143 (line 734)
  0.09%    0.51%  │    │  0x00007f52d56b027b: mov    0x18(%rsi),%r8
 23.70%    3.54%  │    │  0x00007f52d56b027f: mov    $0x7f4926a81d88,%rcx  ;   {metadata(&apos;org/apache/lucene/analysis/tokenattributes/CharTermAttribute&apos;)}

I've created a custom version of Lucene (and Elasticsearch) where the only difference is that I've eliminated this variable. Here are the results of the nyc_taxis benchmark (again same benchmark setup as always):

Baseline: last revision with Lucene 6.x (revision elastic/elasticsearch@f217eb8ad8): 218361 docs/s
first revision with Lucene 7.0 (revision elastic/elasticsearch@4632661bc7): 125418 docs/s
first revision with Lucene 7.0 (revision elastic/elasticsearch@4632661bc7) + unused variable removed: 221237 docs/s

I've raised LUCENE-7963 together with a patch. As soon as we have a new Lucene snapshot build which contains my patch, we can iterate here. The next step is then to see whether we still need a lock-free implementation of LocalCheckpointTracker (i.e. #26362).

nitsanw commented 7 years ago

@danielmitterdorfer If I read this correctly (and there's always a chance to misread assembly) the line blame here is falsely accused. At least the assembly line (not arguing about the unused local var). The line: 0x00007f52d56b027f: mov $0x7f4926a81d88,%rcx I setting the RCX register to a constant, so costs nothing and cannot be the issue. It's likely that the issue is in fact one of the lines preceding it:

  0.00%    0.00%  │    │  0x00007f52d56b0278: mov    (%r11),%rsi        ;*getfield next
                  │    │                                                ; - java.util.HashMap::getNode@137 (line 580)
                  │    │                                                ; - java.util.LinkedHashMap::get@6 (line 440)
                  │    │                                                ; - org.apache.lucene.util.AttributeSource::getAttribute@6 (line 245)
                  │    │                                                ; - org.apache.lucene.index.DefaultIndexingChain$PerField::invert@143 (line 734)
  0.09%    0.51%  │    │  0x00007f52d56b027b: mov    0x18(%rsi),%r8
 23.70%    3.54%  │    │  0x00007f52d56b027f: mov    $0x7f4926a81d88,%rcx  ;   {metadata(&apos;org/apache/lucene/analysis/tokenattributes/CharTermAttribute&apos;)}
  0.00%    0.00%  │    │  0x00007f52d56b0289: cmp    %rcx,%r8
                  │    │  0x00007f52d56b028c: jne    0x00007f52d56b0949  ;*instanceof
                  │    │                                                ; - org.apache.lucene.document.Field::binaryValue@4 (line 441)
                  │    │                                                ; - org.apache.lucene.document.Field::tokenStream@65 (line 487)
                  │    │                                                ; - org.apache.lucene.index.DefaultIndexingChain$PerField::invert@99 (line 729)

The cost would be the code in here: https://github.com/apache/lucene-solr/blob/e2521b2a8baabdaf43b92192588f51e042d21e97/lucene/core/src/java/org/apache/lucene/util/AttributeSource.java#L244 Which is a HashMap lookup, which is probably a costly cache miss. The cache miss theory is supported by the poor CPI (blamed on the wrong instruction) of roughly 6.5 cycles per instruction. I would hazard a guess the hash lookup is a loop (see here) , and this is not one cache miss but many that get blamed on the wrong instruction.

The line of code, while initialising an unused local variable, cannot be easily eliminated by the JVM as the lookup code may have a side effect (e.g. return the wrong type and trigger a ClassCastException), and it is hard to prove that a complex piece of code such as the one triggered to init the variable has no side effects.

In any case, removing the unused variable is (as backed by your results) the right solution to the issue :-)

danielmitterdorfer commented 7 years ago

The cost would be the code in here: https://github.com/apache/lucene-solr/blob/e2521b2a8baabdaf43b92192588f51e042d21e97/lucene/core/src/java/org/apache/lucene/util/AttributeSource.java#L244 [...] which is a HashMap lookup, which is probably a costly cache miss.

I have to admit, I misread this instruction (reading assembly is not something I do every day ;) ). Your explanation is much more sound to me as to why we have such a huge cost.

The line of code, while initialising an unused local variable, cannot be easily eliminated by the JVM as the lookup code may have a side effect (e.g. return the wrong type and trigger a ClassCastException), and it is hard to prove that a complex piece of code such as the one triggered to init the variable has no side effects.

That makes perfect sense to me and also solves the second part of the puzzle. Thanks a lot for your explanations @nitsanw. I really appreciate it.

danielmitterdorfer commented 7 years ago

I've run final benchmarks now. It's the same benchmarking scenario as always but I'll summarize it here.

Benchmark Setup and Methodology

Load driver machine: c3.8xlarge, Ubuntu 16.04
Target machine: i3.16xlarge, Ubuntu 16.04

Elasticsearch is configured to use 8 data paths (ext4-formatted) and 31 GB heap size. Before each trial run, we wipe the disks and run TRIM.

config/elasticsearch.yml:

network.host: 0.0.0.0
path.data: [/mnt/data0, /mnt/data1, /mnt/data2, /mnt/data3, /mnt/data4, /mnt/data5, /mnt/data6, /mnt/data7]
indices.memory.index_buffer_size: 50%

config/jvm.options: out of the box configuration, except that we set:

-Xms31G
-Xmx31G

We use Rally and the nyc_taxis track with the following challenge:

{
  "name": "index-contention",
  "description": "Indexes the whole document corpus with more clients",
  "index-settings": {
    "index.number_of_replicas": 0,
    "index.number_of_shards": 8,
    "index.refresh_interval": "100s",
    "index.translog.flush_threshold_size": "4gb",
    "index.translog.durability": "async",
    "index.translog.sync_interval": "240s"
  },
  "schedule": [
    {
      "operation": "index",
      "warmup-time-period": 240,
      "clients": 24
    }
  ]
}

Results

Version	`processors` setting	Min Indexing Throughput	Median Indexing Throughput	Max Indexing Throughput
5.5.2	"default" (was bounded to 32)	229286 docs/s	239590 docs/s	242889 docs/s
6.0.0-beta2*	32	230524 docs/s	236368 docs/s	242602 docs/s
6.0.0-beta2*	"default" (unbounded, i.e. 64)	227845 docs/s	238880 docs/s	246613 docs/s
6.0.0-beta2* + https://github.com/elastic/elasticsearch/pull/26362	32	228910 docs/s	234151 docs/s	240606 docs/s
6.0.0-beta2* + https://github.com/elastic/elasticsearch/pull/26362	"default" (unbounded, i.e. 64)	221387 docs/s	237433 docs/s	242800 docs/s

*) includes the patch from LUCENE-7963

Analysis

We see now comparable performance between 5.5.2 and 6.0.0-beta2. In JFR (Java flight recorder) we can see that the new unbounded default for processors causes significantly more contention than with 32 processors. This is also the reason we are not seeing much improvement between setting processors to 32 and keeping the default value. While this is an unfortunate situation I think it is out of the scope of this ticket to fix these contention issues.

Comparing the lock-free implementation of LocalCheckpointTracker (see https://github.com/elastic/elasticsearch/pull/26362) with the current implementation does not yield any benefits and that's why I will also close the PR without merging it. Flight recordings show that we just put more pressure on the monitor lock on TranslogWriter. So while it may be beneficial in the future to implement LocalCheckpointTracker lock-free for machines with a high core count, I think we first need to address other issues before we can see a benefit.

I'll leave this ticket open until we've actually upgraded to a Lucene version that includes the patch from LUCENE-7963.

Thanks @muralikpbhat for bringing this up.

muralikpbhat commented 7 years ago

Thanks @danielmitterdorfer for taking this to a nice conclusion. Very interesting find of getAttribute() issue in LUCENE-7963

In JFR (Java flight recorder) we can see that the new unbounded default for processors causes significantly more contention than with 32 processors. This is also the reason we are not seeing much improvement between setting processors to 32 and keeping the default value.

I think we first need to address other issues before we can see a benefit.

Surprising that your lock-free implementation is not giving better performance with 64cores. Are you saying Translogwriter monitor lock needs to be fixed before we make this lock free or are there more issues than the Translogwriter?

coderplay commented 7 years ago

My recent test on nyc_taxis with 5.5.2 is about 288K, but with different configuration. Will make another comparison later.

danielmitterdorfer commented 7 years ago

Surprising that your lock-free implementation is not giving better performance with 64cores. Are you saying Translogwriter monitor lock needs to be fixed before we make this lock free or are there more issues than the Translogwriter?

As far as I can see from the most recent flight recordings, the reason is that LocalCheckpointTracker is not the bottleneck. Although there is measurable contention in the current implementation, making it lock-free just increases contention in TranslogWriter.

IMHO, Elasticsearch's approach to scalability is to scale horizontally rather than vertically (i.e. many small nodes instead of a few large nodes). As a consequence of that design decision, we see contention on machines with a high number of cores. Or more specifically: The root cause is that multiple writers attempt to write concurrently to the translog and guarding it with a lock does not scale to machines with high core counts. A more scalable solution could be to apply the single-writer principle.

s1monw commented 7 years ago

IMHO, Elasticsearch's approach to scalability is to scale horizontally rather than vertically (i.e. many small nodes instead of a few large nodes). As a consequence of that design decision, we see contention on machines with a high number of cores. Or more specifically: The root cause is that multiple writers attempt to write concurrently to the translog and guarding it with a lock does not scale to machines with high core counts. A more scalable solution could be to apply the single-writer principle.

I had multiple attempts to this in the past that didn't yield massive speed improvements compared to the complexity it added. Lets say you'd use a ringbuffer to do this you'd need to copy the payload for every document which in-turn adds a reasonable complexity in terms of memory consumption and GC. Today the impl is basically writing to a buffered stream which is reusing it's buffer so we only pay the price of copying the bytes. It's all hidden behind the stream interface which is nice. IMO there needs to be a significant speed improvement to justify a Single Writer refactoring.

coderplay commented 7 years ago

Configurations

Load driver machine: i3.16xlarge, Amazon Linux
Target machine: i3.16xlarge, Amazon Linux, 128GB heap configured for ES, G1 GC, MaxGCPauseMillis=1000, indexing buffer size of 50%, data.path spread across 8 disks, 32 netty workers, 64 bulk workers.

Challenge settings:

      "index-settings": {
        "index.number_of_shards": 24,
        "index.codec": "best_compression",
        "index.number_of_replicas": 0,
        "index.refresh_interval": "100s",
        "index.translog.flush_threshold_size": "4g",
        "index.translog.durability": "async",
        "index.translog.sync_interval": "240s"
      },
      "schedule": [
        {
          "operation": "index",
          "warmup-time-period": 240,
          "clients": 48
        }
      ]

Results

Version	Min Indexing Throughput	Median Indexing Throughput	Max Indexing Throughput
5.5.2	268327 docs/s	281017 docs/s	288709 docs/s
6.0.0-beta2	166966 docs/s	168839 docs/s	171388 docs/s
6.0.0-beta2*	271438 docs/s	277353 docs/s	285558 docs/s

5.5.2 loosed max threads number limit.

@danielmitterdorfer Still a nice work!

The TranslogWriter lock contention problem I believe has been around for a long time. 5.5.2 also has that problem. Writings on the same index shard share the same WAL writer. Single writer principle is the ideal solution. In reality, we usually use group commit to solve this problem.

Hbase uses disruptor ringbuffer achieved group commit. We probably can learn from it.

@s1monw I don't think we need to copy the payload, since TransLog already created a new byte array in ReleasableBytesStreamOutput.

  public Location add(final Operation operation) throws IOException {
        final ReleasableBytesStreamOutput out = new ReleasableBytesStreamOutput(bigArrays);
...
            final ReleasablePagedBytesReference bytes = out.bytes();
...

just keep the reference object in the buffer and dump them to the WAL file in a batch should work.

Regarding to the GC, elasticsearch already had serious YGC issue. The JVM allocates ~9.11GB objects per second in nyc_taxis benchmark, most of them are allocated by org.elasticsearch.index.mapper.DocumentParser.internalParseDocument(Mapping, ParseContext$InternalParseContext, XContentParser).
The nyc_taxis indexing takes ~630 secs in 5.5.2, times 9.11 GB/s , that's 5.74TB memory allocated just for a size of 74.3 GB dataset. The GC couldn't be worse even we copy payloads.

jpountz commented 7 years ago

Fixed via #26744