elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.2k stars 24.84k forks source link

Excessive GC times #28227

Closed motherhubbard closed 6 years ago

motherhubbard commented 6 years ago

Elasticsearch version (bin/elasticsearch --version): Version: 5.5.1, Build: 19c13d0/2017-07-18T20:44:24.823Z, JVM: 1.8.0_141

JVM version (java -version): openjdk version "1.8.0_141" OpenJDK Runtime Environment (build 1.8.0_141-b16) OpenJDK 64-Bit Server VM (build 25.141-b16, mixed mode)

OS version (uname -a if on a Unix-like system): Linux kyc-demo 2.6.32-696.10.1.el6.x86_64 #1 SMP Tue Aug 22 18:51:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

We are seeing excessive gc warning messages in the elastic logs along the lines of: [2018-01-15T15:39:27,626][INFO ][o.e.m.j.JvmGcMonitorService] [main-single-node] [gc][255485] overhead, spent [447ms] collecting in the last [1s]

We are seeing this on a couple of our elastic installations now (performing different tasks). I've managed to isolate one of the causes of the warning message to being when we issue a bulk request containing a number of updates (by id) (which include a groovy script).

After seeing https://github.com/elastic/elasticsearch/issues/11371 we have rewritten the the script in painless but gc warnings are still seen.

In the test elastic install we have approximately 300k documents over 15 shards on 1 ES node running with: -Xms5g -Xmx5g

The total bulk request length is 3.6MB with 60 update items within the batch.

A sample call takes 2.7 seconds. During this call the following lines are seen in the elastic log: [2018-01-15T16:49:11,595][INFO ][o.e.m.j.JvmGcMonitorService] [main-single-node] [gc][259607] overhead, spent [425ms] collecting in the last [1s] [2018-01-15T16:49:12,595][INFO ][o.e.m.j.JvmGcMonitorService] [main-single-node] [gc][259608] overhead, spent [327ms] collecting in the last [1s]

Its worth saying we have a lot of nested documents in our schema. The sanitised mapping is included below: { "ourindex": { "mappings": { "ourtype": { "dynamic": "strict", "properties": { "a1": { "type": "boolean" }, "a2": { "type": "nested", "properties": { "a3": { "type": "keyword" }, "a4": { "type": "keyword" }, "a5": { "type": "keyword" }, "a6": { "type": "long" }, "a7": { "type": "keyword" }, "a8": { "type": "keyword" }, "a9": { "dynamic": "true", "properties": { "a10": { "type": "nested", "properties": { "a11": { "type": "keyword" }, "a12": { "type": "keyword" }, "a13": { "type": "keyword" }, "a14": { "type": "keyword" } } }, "a15": { "type": "keyword" }, "a16": { "type": "keyword" }, "a17": { "type": "keyword" }, "a18": { "type": "boolean" } } }, "a19": { "type": "keyword" }, "a20": { "type": "keyword" } } }, "a21": { "type": "nested", "properties": { "a22": { "type": "keyword" }, "a23": { "type": "keyword" }, "a24": { "type": "keyword" }, "a25": { "type": "long" }, "a26": { "type": "keyword" }, "a27": { "type": "keyword" }, "a28": { "dynamic": "true", "properties": { "a29": { "type": "nested", "properties": { "a30": { "type": "keyword" }, "a31": { "type": "keyword" }, "a32": { "type": "keyword" }, "a33": { "type": "keyword" } } }, "a34": { "type": "keyword" }, "a35": { "type": "keyword" }, "a36": { "type": "keyword" }, "a37": { "type": "boolean" } } }, "a38": { "type": "keyword" }, "a39": { "type": "keyword" } } }, "a40": { "type": "keyword" }, "a41": { "type": "long" }, "a42": { "type": "long" }, "a43": { "type": "date", "format": "epoch_millis||strict_date_optional_time" }, "a44": { "type": "nested", "properties": { "a45": { "type": "date", "format": "epoch_millis||strict_date_optional_time" }, "a46": { "properties": { "a47": { "type": "keyword" }, "a48": { "type": "keyword" }, "a49": { "type": "keyword" }, "a50": { "type": "long" }, "a51": { "type": "keyword" }, "a52": { "type": "keyword" }, "a53": { "dynamic": "true", "properties": { "a54": { "type": "nested", "properties": { "a55": { "type": "keyword" }, "a56": { "type": "keyword" }, "a57": { "type": "keyword" }, "a58": { "type": "keyword" } } }, "a59": { "type": "keyword" }, "a60": { "type": "keyword" }, "a61": { "type": "keyword" }, "a62": { "type": "boolean" } } }, "a63": { "type": "keyword" }, "a64": { "type": "keyword" } } }, "a65": { "type": "keyword" } } }, "a66": { "type": "date", "format": "epoch_millis||strict_date_optional_time" }, "a67": { "properties": { "a68": { "properties": { "a69": { "type": "date", "format": "epoch_millis||strict_date_optional_time" }, "a70": { "properties": { "a71": { "type": "keyword" }, "a72": { "type": "keyword" }, "a73": { "type": "keyword" }, "a74": { "type": "long" }, "a75": { "type": "keyword" }, "a76": { "type": "keyword" }, "a77": { "dynamic": "true", "properties": { "a78": { "type": "nested", "properties": { "a79": { "type": "keyword" }, "a80": { "type": "keyword" }, "a81": { "type": "keyword" }, "a82": { "type": "keyword" } } }, "a83": { "type": "keyword" }, "a84": { "type": "keyword" }, "a85": { "type": "keyword" }, "a86": { "type": "boolean" } } }, "a87": { "type": "keyword" }, "a88": { "type": "keyword" } } }, "a89": { "type": "keyword" } } }, "a90": { "type": "integer" }, "a91": { "type": "integer" }, "a92": { "type": "integer" }, "a93": { "type": "date", "format": "epoch_millis||strict_date_optional_time" } } }, "a94": { "type": "keyword" }, "a95": { "dynamic": "strict", "properties": { "a96": { "type": "text", "analyzer": "lowercase_analyzer" }, "a97": { "type": "text", "analyzer": "lowercase_analyzer" }, "a98": { "dynamic": "strict", "properties": { "a99": { "type": "keyword" }, "a100": { "type": "keyword" }, "a101": { "type": "keyword" }, "a102": { "type": "keyword" }, "a103": { "type": "keyword" }, "a104": { "type": "keyword" }, "a105": { "type": "keyword" }, "a106": { "type": "keyword" }, "a107": { "type": "keyword" }, "a108": { "type": "keyword" }, "a109": { "type": "keyword" } } }, "a110": { "dynamic": "strict", "properties": { "a111": { "type": "keyword" }, "a112": { "type": "keyword" }, "a113": { "type": "keyword" }, "a114": { "type": "keyword" }, "a115": { "type": "keyword" }, "a116": { "type": "keyword" }, "a117": { "type": "keyword" }, "a118": { "type": "keyword" } } }, "a119": { "type": "keyword" }, "a120": { "type": "keyword" }, "a121": { "type": "keyword" }, "a122": { "type": "keyword" }, "a123": { "type": "keyword" }, "a124": { "type": "keyword" }, "a125": { "type": "keyword" }, "a126": { "type": "date" }, "a127": { "type": "keyword" }, "a128": { "type": "keyword" }, "a129": { "type": "keyword" }, "a130": { "dynamic": "strict", "properties": { "a131": { "type": "keyword" }, "a132": { "type": "keyword" }, "a133": { "type": "keyword" } } }, "a134": { "type": "keyword" }, "a135": { "type": "keyword" }, "a136": { "dynamic": "strict", "properties": { "a137": { "type": "object", "enabled": false }, "a138": { "type": "object", "enabled": false } } }, "a139": { "type": "keyword" }, "a140": { "type": "keyword" }, "a141": { "type": "keyword" }, "a142": { "type": "keyword" }, "a143": { "type": "date" }, "a144": { "type": "keyword" }, "a145": { "type": "keyword" }, "a146": { "type": "date" }, "a147": { "type": "keyword" }, "a148": { "type": "text", "analyzer": "lowercase_analyzer" }, "a149": { "type": "text", "analyzer": "lowercase_analyzer" }, "a150": { "type": "text", "analyzer": "lowercase_analyzer" }, "a151": { "type": "text", "analyzer": "lowercase_analyzer" }, "a152": { "type": "text", "analyzer": "lowercase_analyzer" }, "a153": { "type": "keyword" }, "a154": { "type": "keyword" }, "a155": { "type": "keyword" }, "a156": { "type": "keyword" }, "a157": { "dynamic": "strict", "properties": { "a158": { "type": "keyword" }, "a159": { "type": "keyword" }, "a160": { "type": "keyword" }, "a161": { "type": "date" }, "a162": { "type": "keyword" }, "a163": { "type": "text" }, "a164": { "type": "date" }, "a165": { "type": "keyword" } } }, "a166": { "type": "keyword" }, "a167": { "dynamic": "true", "properties": { "a168": { "type": "keyword" }, "a169": { "type": "keyword" }, "a170": { "type": "keyword" }, "a171": { "type": "keyword" }, "a172": { "type": "keyword" }, "a173": { "type": "nested", "properties": { "a174": { "type": "keyword" }, "a175": { "type": "keyword" }, "a176": { "type": "keyword" } } }, "a177": { "type": "nested", "properties": { "a178": { "type": "keyword" }, "a179": { "type": "keyword" }, "a180": { "type": "keyword" } } }, "a181": { "type": "nested", "properties": { "a182": { "type": "keyword" }, "a183": { "type": "keyword" }, "a184": { "type": "keyword" } } }, "a185": { "type": "boolean" }, "a186": { "type": "keyword" }, "a187": { "type": "keyword" }, "a188": { "type": "keyword" }, "a189": { "type": "keyword" }, "a190": { "type": "keyword" }, "a191": { "type": "keyword" }, "a192": { "type": "keyword" }, "a193": { "type": "keyword" }, "a194": { "type": "keyword" }, "a195": { "type": "keyword" } } }, "a196": { "type": "keyword" }, "a197": { "type": "keyword" }, "a198": { "type": "keyword", "normalizer": "lowercase_normalizer" }, "a199": { "type": "keyword", "normalizer": "lowercase_normalizer" }, "a200": { "type": "keyword", "normalizer": "lowercase_normalizer" }, "a201": { "type": "keyword", "normalizer": "lowercase_normalizer" }, "a202": { "type": "keyword", "normalizer": "lowercase_normalizer" }, "a203": { "type": "keyword", "normalizer": "lowercase_normalizer" }, "a204": { "type": "keyword", "normalizer": "lowercase_normalizer" }, "a205": { "type": "keyword", "normalizer": "lowercase_normalizer" }, "a206": { "type": "keyword", "normalizer": "lowercase_normalizer" }, "a207": { "type": "keyword", "normalizer": "lowercase_normalizer" }, "a208": { "type": "keyword", "normalizer": "lowercase_normalizer" }, "a209": { "type": "keyword", "normalizer": "lowercase_normalizer" }, "a210": { "type": "keyword", "normalizer": "lowercase_normalizer" }, "a211": { "type": "keyword" }, "a212": { "type": "keyword" }, "a213": { "type": "keyword" } } }, "a214": { "type": "keyword" }, "a215": { "type": "keyword" }, "a216": { "type": "date", "format": "epoch_millis||strict_date_optional_time" }, "a217": { "type": "date", "format": "epoch_millis||strict_date_optional_time" }, "a218": { "type": "date", "format": "epoch_millis||strict_date_optional_time" }, "a219": { "type": "nested", "properties": { "a220": { "type": "boolean" }, "a221": { "type": "keyword" }, "a222": { "type": "text" }, "a223": { "type": "long" }, "a224": { "type": "integer" }, "a225": { "type": "keyword" }, "a226": { "type": "nested", "properties": { "a227": { "type": "text", "index": false }, "a228": { "type": "long" }, "a229": { "type": "keyword" }, "a230": { "type": "text" }, "a231": { "properties": { "a232": { "properties": { "a233": { "type": "keyword" }, "a234": { "type": "keyword" }, "a235": { "type": "keyword" } } } } }, "a236": { "type": "long" }, "a237": { "type": "keyword" }, "a238": { "type": "keyword" }, "a239": { "type": "keyword" }, "a240": { "type": "long" } } }, "a241": { "type": "long" }, "a242": { "type": "long" }, "a243": { "type": "long" }, "a244": { "type": "date", "format": "epoch_millis||strict_date_optional_time" }, "a245": { "type": "date", "format": "epoch_millis||strict_date_optional_time" }, "a246": { "type": "boolean" } } }, "a247": { "type": "nested", "properties": { "a248": { "type": "keyword" }, "a249": { "type": "date" } } }, "a250": { "dynamic": "true", "properties": { "a251": { "type": "keyword" }, "a252": { "type": "keyword" }, "a253": { "type": "keyword" }, "a254": { "type": "keyword" }, "a255": { "type": "keyword" }, "a256": { "type": "nested", "properties": { "a257": { "type": "keyword" }, "a258": { "type": "keyword" }, "a259": { "type": "keyword" } } }, "a260": { "type": "nested", "properties": { "a261": { "type": "keyword" }, "a262": { "type": "keyword" }, "a263": { "type": "keyword" } } }, "a264": { "type": "nested", "properties": { "a265": { "type": "keyword" }, "a266": { "type": "keyword" }, "a267": { "type": "keyword" } } }, "a268": { "type": "boolean" }, "a269": { "type": "keyword" }, "a270": { "type": "keyword" }, "a271": { "type": "keyword" }, "a272": { "type": "keyword" }, "a273": { "type": "keyword" }, "a274": { "type": "keyword" }, "a275": { "type": "keyword" }, "a276": { "type": "keyword" }, "a277": { "type": "keyword" }, "a278": { "type": "keyword" } } }, "a279": { "type": "integer" }, "a280": { "type": "nested", "properties": { "a281": { "properties": { "a282": { "type": "date", "format": "epoch_millis||strict_date_optional_time" }, "a283": { "properties": { "a284": { "type": "keyword" }, "a285": { "type": "keyword" }, "a286": { "type": "keyword" }, "a287": { "type": "long" }, "a288": { "type": "keyword" }, "a289": { "type": "keyword" }, "a290": { "dynamic": "true", "properties": { "a291": { "type": "nested", "properties": { "a292": { "type": "keyword" }, "a293": { "type": "keyword" }, "a294": { "type": "keyword" }, "a295": { "type": "keyword" } } }, "a296": { "type": "keyword" }, "a297": { "type": "keyword" }, "a298": { "type": "keyword" }, "a299": { "type": "boolean" } } }, "a300": { "type": "keyword" }, "a301": { "type": "keyword" } } }, "a302": { "type": "keyword" } } }, "a303": { "type": "integer" }, "a304": { "type": "long" }, "a305": { "type": "integer" }, "a306": { "type": "integer" }, "a307": { "type": "date", "format": "epoch_millis||strict_date_optional_time" } } }, "a308": { "type": "nested", "properties": { "a309": { "properties": { "a310": { "type": "date", "format": "epoch_millis||strict_date_optional_time" }, "a311": { "properties": { "a312": { "type": "keyword" }, "a313": { "type": "keyword" }, "a314": { "type": "keyword" }, "a315": { "type": "long" }, "a316": { "type": "keyword" }, "a317": { "type": "keyword" }, "a318": { "dynamic": "true", "properties": { "a319": { "type": "nested", "properties": { "a320": { "type": "keyword" }, "a321": { "type": "keyword" }, "a322": { "type": "keyword" }, "a323": { "type": "keyword" } } }, "a324": { "type": "keyword" }, "a325": { "type": "keyword" }, "a326": { "type": "keyword" }, "a327": { "type": "boolean" } } }, "a328": { "type": "keyword" }, "a329": { "type": "keyword" } } }, "a330": { "type": "keyword" } } }, "a331": { "type": "integer" }, "a332": { "type": "keyword" }, "a333": { "type": "keyword" }, "a334": { "type": "date" }, "a335": { "type": "keyword" }, "a336": { "type": "long" }, "a337": { "properties": { "a338": { "type": "keyword" }, "a339": { "type": "keyword" }, "a340": { "type": "keyword" }, "a341": { "type": "long" }, "a342": { "type": "keyword" }, "a343": { "type": "keyword" }, "a344": { "dynamic": "true", "properties": { "a345": { "type": "nested", "properties": { "a346": { "type": "keyword" }, "a347": { "type": "keyword" }, "a348": { "type": "keyword" }, "a349": { "type": "keyword" } } }, "a350": { "type": "keyword" }, "a351": { "type": "keyword" }, "a352": { "type": "keyword" }, "a353": { "type": "boolean" } } }, "a354": { "type": "keyword" }, "a355": { "type": "keyword" } } }, "a356": { "type": "integer" }, "a357": { "type": "integer" }, "a358": { "type": "long" }, "a359": { "type": "long" }, "a360": { "type": "date", "format": "epoch_millis||strict_date_optional_time" }, "a361": { "type": "nested", "properties": { "a362": { "type": "keyword" }, "a363": { "type": "keyword" }, "a364": { "type": "keyword" }, "a365": { "type": "long" }, "a366": { "type": "keyword" }, "a367": { "type": "keyword" }, "a368": { "dynamic": "true", "properties": { "a369": { "type": "nested", "properties": { "a370": { "type": "keyword" }, "a371": { "type": "keyword" }, "a372": { "type": "keyword" }, "a373": { "type": "keyword" } } }, "a374": { "type": "keyword" }, "a375": { "type": "keyword" }, "a376": { "type": "keyword" }, "a377": { "type": "boolean" } } }, "a378": { "type": "keyword" }, "a379": { "type": "keyword" } } }, "a380": { "type": "long" }, "a381": { "type": "keyword" } } }, "a382": { "type": "integer" } } } } } }

We have tried rolling the 60 separate update requests into a single call but this did not help the timings and again the gc warnings were seen.

We have benchmarked our script and it typically takes around 1ms to run.

Each update will update up to one document (so in the batch we will update up to 60 documents).

On our other cluster where the gc warnings are also seen the schema is simpler (in that there are a lot less nested documents) but we do still use the scripted update.

I'm intrigued to know why this seems to be so slow.

Thanks in advance.

motherhubbard commented 6 years ago

I've started recording timings based on the number of items in the batch:

Items Request size Elastic Duration (Range in ms)
1 75273 40 - 90
2 160498 100 - 200
3 233460 100 - 200
4 271235 100 - 200
5 352107 355 - 600
6 391510 355 - 600
10 624568 355 - 600
20 1230996 700 - 1200
50 3031171 1600 - 2500 (I start seeing GC warning messages at this point)
60 3643834 2300 - 7470 ( the outlier had a very big GC pause: [2018-01-16T10:18:35,469][WARN ][o.e.m.j.JvmGcMonitorService] [main-single-node] [gc][322550] overhead, spent [4.2s] collecting in the last [4.7s])

I am calculating the range over 10 requests for each row in the table.

There is no other load on the box at the same time. I have 8 CPUs on the server.

Are these big GCs to be expected?

motherhubbard commented 6 years ago

Looking at the JVM stats, For our 3.6M input message we seem to be adding about 1G to the heap (excluding anything that may have been collected during the process): Before request:

          "heap_used_in_bytes" : 1706120912,
          "heap_used_percent" : 32,
          "heap_committed_in_bytes" : 5298978816,
          "heap_max_in_bytes" : 5298978816,
          "non_heap_used_in_bytes" : 168750512,
          "non_heap_committed_in_bytes" : 220000256,
          "pools" : {
            "young" : {
              "used_in_bytes" : 75764152,
              "max_in_bytes" : 558432256,
              "peak_used_in_bytes" : 558432256,
              "peak_max_in_bytes" : 558432256
            },
            "survivor" : {
              "used_in_bytes" : 44046728,
              "max_in_bytes" : 69730304,
              "peak_used_in_bytes" : 69730304,
              "peak_max_in_bytes" : 69730304
            },
            "old" : {
              "used_in_bytes" : 1586310032,
              "max_in_bytes" : 4670816256,
              "peak_used_in_bytes" : 4645066960,
              "peak_max_in_bytes" : 4670816256
            }
          }
        },
        "threads" : {
          "count" : 90,
          "peak_count" : 102
        },
        "gc" : {
          "collectors" : {
            "young" : {
              "collection_count" : 25418,
              "collection_time_in_millis" : 3737009
            },
            "old" : {
              "collection_count" : 367,
              "collection_time_in_millis" : 121855
            }
          }
        }, 

after request:

          "heap_used_in_bytes" : 2642199544,
          "heap_used_percent" : 49,
          "heap_committed_in_bytes" : 5298978816,
          "heap_max_in_bytes" : 5298978816,
          "non_heap_used_in_bytes" : 168800432,
          "non_heap_committed_in_bytes" : 220000256,
          "pools" : {
            "young" : {
              "used_in_bytes" : 151948208,
              "max_in_bytes" : 558432256,
              "peak_used_in_bytes" : 558432256,
              "peak_max_in_bytes" : 558432256
            },
            "survivor" : {
              "used_in_bytes" : 26625440,
              "max_in_bytes" : 69730304,
              "peak_used_in_bytes" : 69730304,
              "peak_max_in_bytes" : 69730304
            },
            "old" : {
              "used_in_bytes" : 2463625896,
              "max_in_bytes" : 4670816256,
              "peak_used_in_bytes" : 4645066960,
              "peak_max_in_bytes" : 4670816256
            }
          }
        },
        "threads" : {
          "count" : 87,
          "peak_count" : 102
        },
        "gc" : {
          "collectors" : {
            "young" : {
              "collection_count" : 25426,
              "collection_time_in_millis" : 3737708
            },
            "old" : {
              "collection_count" : 367,
              "collection_time_in_millis" : 121855
            }
          }
        },  

Is this expected?

jpountz commented 6 years ago

What does the scripted update?

motherhubbard commented 6 years ago

@jpountz Thanks for responding to me.

The script updates the document in a number of places (its not easy for me to share the script) based on the params that are passed to it.

Theres a couple simple updates to the top level fields and the rest are pushes to some of the arrays in the document (which will relate to the nested fields in the mappings). If you need me to convert the script to the obfuscated names I can try and do that.

Sorry for being a little vague, is there something specific you are thinking of?

The script typically runs in 1ms.

DaveCTurner commented 6 years ago

Hi @motherhubbard, I think this question is better suited to the support forums over at https://discuss.elastic.co. We prefer to use Github issues only for bug reports and feature requests, and we think it's more likely this is a question than a bug report. I'm closing this as an issue here, but encourage you to ask this question in the forum instead. Please add a link to the forum discussion below so others can follow the thread. Many thanks.

It will be quite difficult to provide much help without being able to reproduce this behaviour. Please can you try and get things into a state that you can share, including your script and some example data that shows the same problem?