ES 1.4.2 random node disconnect

dragosrosculete commented 9 years ago

Hey,

I am having trouble for some while. I am getting random node disconnects and I cannot explain why. There is no increase in traffic ( search or index ) when this is happening , it feels so random to me . I first thought it could be the aws cloud plugin so I removed it and used unicast and pointed directly to my nodes IPs but that didn't seem to be the problem . I changed the type of instances, now m3.2xlarge, added more instances, made so much modifications in ES yml config and still nothing . Changed java oracle from 1.7 to 1.8 , changed CMS collector to G1GC and still nothing .

I am out of ideas ... how can I get more info on what is going on ?

Here are the logs I can see from master node and the data node http://pastebin.com/GhKfRkaa

faxm0dem commented 9 years ago

You're saying having larger indices (in my case e.g. weekly or monthly) would help? (I already have 100GB+ indices)

clintongormley commented 9 years ago

@faxm0dem as always, "it depends" :) Your use case, hardware, etc, etc is all so personal, that it is really hard to give good general advice. Have a look at this for a good approach to figuring it out yourself: http://www.elastic.co/guide/en/elasticsearch/guide/current/capacity-planning.html

That said, just a couple of data points which will help:

a shard is a Lucene index, which uses memory, file handles, etc, so it is a good idea to try to reduce this number wherever possible
the configured max segment size is 5GB, and having 20 segments in an index is health. having 100 or so segments is less healthy, and (depending on X, Y, Z, etc) may impact performance, so a shard with 100GB "should" be just fine
having one shard per node will distribute writes across all your nodes, so adjusting your time periods accordingly should help

faxm0dem commented 9 years ago

Thanks @clintongormley for this information. I was afraid to have too large shards using monthly indices, as reallocation would take ages on a 1G link, so I'm not sure how to handle this. That being said, just some more info here I upgraded to 1.5.1 and it happened again on a 1.5.1 node (master was still 1.4.2).

faxm0dem commented 9 years ago

Note: my indices.memory.index_buffer_size is set to 30%

faxm0dem commented 9 years ago

any pointers on how I can grab details on the java front, e.g. jstack, jdump, etc?

I collected the following so far on a deadlocked instance:

jmap -heap
jmap -histo
jmap -dump
jstack
jstack -l

Update: while I played with the above commands, the node "unlocked" itself after half an hour: the node came back up with a GC message:

[2015-04-17 12:05:57,407][WARN ][monitor.jvm              ] [ccsvli83] [gc][young][140139][3373] duration [29.5m], collections [1]/[29.5m], total [29.5m]/[31.8m], memory [10.1gb]->[8.7gb]/[19.8gb], all_pools {[young] [1.4gb]->[40.8mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [8.4gb]->[8.5gb]/[18.1gb]}

clintongormley commented 9 years ago

@faxm0dem Hmmm, the only time I've seen long young gen GCs on a regular basis was with Ubuntu 10.4. It was a kernel bug. I have no idea about Scientific Linux, but you may want to try (a) upgrading your JVM and/or (b) trying a different distro.

faxm0dem commented 9 years ago

I'm hitting the issue again. Strange thing is, running jstack -F $pid will reproducibly unlock elasticsearch. Anyone already seen something like this?

BTW, I'm hitting the issue regardless of the jvm (tried openjdk 1.7, 1.8 and oracle 1.8)

gurvindersingh commented 9 years ago

yes and we are doing the same jstack -F $pid to bring node back in cluster. We have upgraded to 1.4.5 too considering it has the mentioned fix. But it does not solve the problem. The main reason seems to be large number shards/node.

faxm0dem commented 9 years ago

Well we are running 1.5.1. I found a few helpful responses on stackoverflow as to why running jstack would revive elasticsearch. Most of these answers point to thread starvation and/or the GC. I still don't understand why only three out of 9 nodes present the issue.

faxm0dem commented 9 years ago

I downgraded the kernel from 2.6.32-504.12.2.el6 to 2.6.32.431.17.1.el6 and the problem disappears. One of the nodes running 2.6.32-504.8.1.el6 seems unaffected, so I guess it's a change introduced between 504.8.1 and 504.12.2.

@gurvindersingh @Revan007 @schlitzered what are your kernel versions?

EL6 kernel changelog:

* Fri Jan 30 2015 Radomir Vrbovsky <rvrbovsk@redhat.com> [2.6.32-504.12.1.el6]
- [fs] splice: perform generic write checks (Eric Sandeen) [1163798 1155900] {CVE-2014-7822}

* Tue Jan 27 2015 Radomir Vrbovsky <rvrbovsk@redhat.com> [2.6.32-504.11.1.el6]
- [virt] kvm: excessive pages un-pinning in kvm_iommu_map error path (Jacob Tanenbaum) [1156520 1156521] {CVE-2014-8369}
- [x86] crypto: Add support for 192 & 256 bit keys to AESNI RFC4106 (Jarod Wilson) [1184332 1176211]
- [block] nvme: Clear QUEUE_FLAG_STACKABLE (David Milburn) [1180555 1155715]
- [net] netfilter: conntrack: disable generic tracking for known protocols (Daniel Borkmann) [1182071 1114697] {CVE-2014-8160}
- [xen] pvhvm: Fix vcpu hotplugging hanging (Vitaly Kuznetsov) [1179343 1164278]
- [xen] pvhvm: Don't point per_cpu(xen_vpcu, 33 and larger) to shared_info (Vitaly Kuznetsov) [1179343 1164278]
- [xen] enable PVHVM VCPU placement when using more than 32 CPUs (Vitaly Kuznetsov) [1179343 1164278]
- [xen] support large numbers of CPUs with vcpu info placement (Vitaly Kuznetsov) [1179343 1164278]

* Thu Jan 22 2015 Radomir Vrbovsky <rvrbovsk@redhat.com> [2.6.32-504.10.1.el6]
- [netdrv] tg3: Change nvram command timeout value to 50ms (Ivan Vecera) [1182903 1176230]

* Thu Jan 08 2015 Radomir Vrbovsky <rvrbovsk@redhat.com> [2.6.32-504.9.1.el6]
- [net] ipv6: increase ip6_rt_max_size to 16384 (Hannes Frederic Sowa) [1177581 1112946]
- [net] ipv6: don't set DST_NOCOUNT for remotely added routes (Hannes Frederic Sowa) [1177581 1112946]
- [net] ipv6: don't count addrconf generated routes against gc limit (Hannes Frederic Sowa) [1177581 1112946]
- [net] ipv6: Don't put artificial limit on routing table size (Hannes Frederic Sowa) [1177581 1112946]
- [scsi] bnx2fc: fix tgt spinlock locking (Maurizio Lombardi) [1179098 1079656]

* Fri Dec 19 2014 Radomir Vrbovsky <rvrbovsk@redhat.com> [2.6.32-504.8.1.el6]
- [crypto] crc32c: Kill pointless CRYPTO_CRC32C_X86_64 option (Jarod Wilson) [1175509 1036212]
- [crypto] testmgr: add larger crc32c test vector to test FPU path in crc32c_intel (Jarod Wilson) [1175509 1036212]
- [crypto] tcrypt: Added speed test in tcrypt for crc32c (Jarod Wilson) [1175509 1036212]
- [crypto] crc32c: Optimize CRC32C calculation with PCLMULQDQ instruction (Jarod Wilson) [1175509 1036212]
- [crypto] crc32c: Rename crc32c-intel.c to crc32c-intel_glue.c (Jarod Wilson) [1175509 1036212]

gurvindersingh commented 9 years ago

I am using the 3.16.5

drax68 commented 9 years ago

Also see this bug on ubuntu precise 12.04 with kernel 3.2.0-83, es 1.5.2, java 1.7.0_80-b15 even with 2Gb of index data.

faxm0dem commented 9 years ago

Sidenote: jconsole can't connect to the JVM when it happens.

sdklein commented 9 years ago

We attempted to upgrade our elastic search cluster from CentOS 6.5 to CentOS 7.1 and ran into this same problem on elastic search 1.5.2. We have the provisioning of the instances puppetized so I am reasonably confident the only difference was CentOS version and kernel.

We downgraded back down to CentOS 6.5 and the errors disappeared and the cluster became stable again.

Some Details

ElasticSearch Version: 1.5.2 CentOS 6 Kernel: 2.6.32-504.1.3.el6.x86_64 (this was stable for us) CentOS 7 Kernel: 3.10.0-229.4.2.el7.x86_64 (this was unstable for us)

faxm0dem commented 9 years ago

@sdklein thanks a lot this is really narrowing things down

so the breaking change seems to be after and excluding 2.6.32-504.8.1

diranged commented 9 years ago

I believe that we're seeing the same issue. We're using Ubuntu 14.04, ES 1.4.4 and the AWS plugin. We get this random failures every few weeks and have to restart our cluster:

Caused by: org.elasticsearch.transport.NodeNotConnectedException: [prod-flume-vpc-es-useast1-58-i-1ff961e3-flume-elasticsearch-production_vpc-useast1][inet[/10.48.49.100:9300]] Node not connected

Note, previously this cluster ran the 0,90 release with the Zookeeper plugin and these failures would not happen. It is a pretty large cluster (~30 days of logs * 10 shards * 2x redundancy, with ~200-250GB of logs per day), but I don't think the machines are over taxed.

faxm0dem commented 9 years ago

Any chance this might be related to https://groups.google.com/forum/#!topic/mechanical-sympathy/QbmpZxp6C64 ?

gurvindersingh commented 9 years ago

Well look like it might be. We will test it on our cluster in coming couple of days and will report back with status.

wilb commented 9 years ago

I have a suspicion that I am seeing this issue on a Debian Wheezy based ES 1.4.5 cluster running on AWS with the AWS plugin. Wheezy's kernel is 3.2.0-4-amd64 so predates the futex issue though.

I've not done enough diagnosis to be sure, but I'm seeing a lot of the symptoms described above. I'll report back with any further findings.

On my master currently a GET to _nodes/_local/stats is taking between 15 and 25 seconds to return.

gurvindersingh commented 9 years ago

At least for us changing kernel to 3.18 branch seems to resolve it. We have been seeing at least 1 random node disconnect per week and it has been now 8 days without any disconnect. Before I can fully confirm I would like to wait maybe 2 more weeks. Although I can recommend updating the kernel to latest stable branch might help.

pires commented 9 years ago

I was having this issue too until I upgraded kernel. Running Ubuntu 14.04 here.

diranged commented 9 years ago

What kernel did you upgrade to?

Sent from my iPhone

On May 29, 2015, at 10:46 AM, Paulo Pires notifications@github.com wrote:

I was having this issue too until I upgraded kernel. Running Ubuntu 14.04 here.

— Reply to this email directly or view it on GitHub.

holgr commented 9 years ago

Is anyone here using CentOS 7.1(.1503) and seeing this issue as well? What kernel works for you?

dragosrosculete commented 9 years ago

Wow. Cant believe this topic is still going on. I found the problem months ago (no thanks to ES guys though) I tried everything and finally received the help from the amazon team. They pointed me to a bug in networking at the kernel level .

After investigating I found out that the bug was also the reason for my node d/c's.

I will post the link to the kernel bug and a workaround later today.

faxm0dem commented 9 years ago

@Revan007 a workaround would ne awesome

dragosrosculete commented 9 years ago

Ok, here it is the bug. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1317811 If you want to find the error it was causing you can look in the syslog for error xen_netfront: xennet: skb rides the rocket: 19 slots It was fixed on kernel linux - 3.13.0-46.75 . If you can update your kernel than do it . This doesn't happen on enhanced network instances like the ones in VPC, r ,c ,i . since they use the "ixgbevf" drivers instead unlike the ones on PV .

If you cant upgrade your kernel right now you can do: sudo ethtool -K eth0 sg off This will disable your "scatter-gather" and may decrease your network performance. ( haven't in my case)

Good luck !

faxm0dem commented 9 years ago

Thanks a lot @Revan007 While this may have been be the reason for your case, I think it's unrelated to our symptoms: we use bare metal nodes and the mentioned kernel message doesn't show up. I think these are at least two different kernel bugs

wilb commented 9 years ago

Agree. Given the vagueness of the issue I think there could be a few bugs at play and we may be experiencing different ones that manifest themselves in a similar way. Or I could just have a badly configured cluster... :stuck_out_tongue_winking_eye:

faxm0dem commented 9 years ago

@wilb the issue's description might be vage and the various comments may have different causes, but I know for a fact that our issue is worked around by using a specific kernel

wilb commented 9 years ago

@faxm0dem don't want you to think that I was having a dig there - just meant that this feels like it could be one of those issues that could have multiple root causes that all manifest themselves in similar ways.

roder commented 9 years ago

I am also having this issue and have submitted a support ticket, but have yet to resolve it.

We're running ES 1.4.4 on Ubuntu 14.04 on GCE. We did not have the "rides the rocket" log lines in syslog and have tested on 3.16 and 3.19 kernels. No resolution yet.

diranged commented 9 years ago

We upgraded to Ubuntu 14.04.02, and saw the same issue again after just about a week of runtime. :(

wilb commented 9 years ago

I got to the bottom of what my particular issue was. Not sure it's likely to help others, but worth detailing...

I use Chef to configure our cluster, one of the things that gets configured is the S3 Repo for backups. Unfortunately I wasn't doing this in an idempotent manner, which resulted in the call coming in every 30 minutes. This wasn't causing any problems initially, but the call will verify the repository by default (this can be disabled by a query string, but I wasn't doing that).

Over time this appeared to become more of an expensive operation (I assume because it's an active backup repo full of data) - this looks to have been causing hosts to briefly lock and result in them temporarily dropping out of the cluster whilst the verify was taking place. Disabling the verify made things behave in a much more sane manner so I rewrote the logic to actually be idempotent.

On top of this, there appears to be a point of degradation at which there is no return unless you restart your master. As I mentioned above, calls to the local node status endpoint of the master start to take 20-30 seconds to return and at this point the cluster generally struggles along with many more disconnects until you take action.

gurvindersingh commented 9 years ago

At least for us upgrading to 3.18 kernel fixed the issue. Our cluster is running now for 3 weeks without any node disconnect whereas earlier we had disconnect at least once a week.

It does seems that there are multiple issues which can cause node disconnect. May be in 1.6.0 the improvement they made to make cluster state changes async might help.

tlrx commented 9 years ago

@wilb Thanks for letting us know. As far as I see you are running a 1.4.5 version of elasticsearch and an issue very similar to yours has been reported in #10344 and fixed in #10366 for 1.5+ versions. Maybe you'll be interested in reading those issues and compare with yours.

wilb commented 9 years ago

Thankyou - that looks exactly like the issue. I'd held off from doing an upgrade to 1.5.x yet but looks like I should definitely do so.

clintongormley commented 9 years ago

It sounds like we've arrived at the root cause of this problem, so I'm going to close the issue. If anybody disagrees, ping me to reopen.

faxm0dem commented 9 years ago

Not really: @wilb had an unrelated issue

faxm0dem commented 9 years ago

Ping @clintongormley

clintongormley commented 9 years ago

@faxm0dem is the issue not to do with the "scatter-gather" bug? @wilb had an unrelated issue, which he has resolved. What else is left here?

faxm0dem commented 9 years ago

Nope, no kernel messages pointing Into that Direction. It's definitely something that happened between the Linux Version I mentioned somewhere above

clintongormley commented 9 years ago

OK, reopening

bradmac commented 9 years ago

Just FYI, we're seeing this pretty continuously with 1.6.0 on Ubuntu 14.04. Going to try some of the suggestions above.

faxm0dem commented 9 years ago

@bradmac downgrading kernel should work if jstack unfreezes ES

bradmac commented 9 years ago

What i'm seeing is that the server node is not stuck, most requests to it are successfully processed. Its just that apparently some of the threads in the client side thread pool are unable to connect to it. This is in lightly loaded conditions.

faxm0dem commented 9 years ago

@bradmac different issue I'd say

diranged commented 9 years ago

Ugh we just had this issue happen twice for us in 3 hours. The first time it happened to one node in our cluster, the sceond time it happened to a different node. In both cases, 1 of the 4 nodes went into a garbage-collection loop. When this happened, the other 3 nodes were disconnected from the gc-looping-node and the cluster basically became unresponsive almost entirely.

Node 75s GC loop -- this stopped on its own

Jul  8 19:19:26.717305 prod-flume-es-useast1-75-i-bc086b43.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-75-i-bc086b43-flume-elasticsearch-production_vpc-useast1] [gc][old][1039423][81040] duration [31.2s], collections [1]/[31.5s], total [31.2s]/[5.2h], memory [89.3gb]->[87.1gb]/[89.9gb], all_pools {[young] [841.6mb]->[51.4mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [88.4gb]->[87gb]/[88.9gb]}
Jul  8 19:20:21.630203 prod-flume-es-useast1-75-i-bc086b43.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-75-i-bc086b43-flume-elasticsearch-production_vpc-useast1] [gc][old][1039448][81042] duration [29.8s], collections [1]/[30.9s], total [29.8s]/[5.2h], memory [89.4gb]->[87.3gb]/[89.9gb], all_pools {[young] [710.8mb]->[135.1mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [88.6gb]->[87.1gb]/[88.9gb]}
Jul  8 19:21:30.190669 prod-flume-es-useast1-75-i-bc086b43.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-75-i-bc086b43-flume-elasticsearch-production_vpc-useast1] [gc][old][1039485][81044] duration [30.2s], collections [1]/[31s], total [30.2s]/[5.2h], memory [89.4gb]->[87.6gb]/[89.9gb], all_pools {[young] [432.7mb]->[161.5mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[87.4gb]/[88.9gb]}
Jul  8 19:22:35.299828 prod-flume-es-useast1-75-i-bc086b43.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-75-i-bc086b43-flume-elasticsearch-production_vpc-useast1] [gc][old][1039520][81046] duration [30s], collections [1]/[30.6s], total [30s]/[5.2h], memory [89.7gb]->[87.5gb]/[89.9gb], all_pools {[young] [721.2mb]->[93.4mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[87.4gb]/[88.9gb]}
Jul  8 19:23:46.081212 prod-flume-es-useast1-75-i-bc086b43.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-75-i-bc086b43-flume-elasticsearch-production_vpc-useast1] [gc][old][1039559][81049] duration [31.1s], collections [2]/[31.9s], total [31.1s]/[5.2h], memory [89.4gb]->[87.5gb]/[89.9gb], all_pools {[young] [503.9mb]->[129.7mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [88.8gb]->[87.4gb]/[88.9gb]}
Jul  8 19:32:25.870668 prod-flume-es-useast1-75-i-bc086b43.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-75-i-bc086b43-flume-elasticsearch-production_vpc-useast1] [gc][old][1040045][81115] duration [30.6s], collections [2]/[31.3s], total [30.6s]/[5.2h], memory [89.8gb]->[87.5gb]/[89.9gb], all_pools {[young] [808mb]->[52.8mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[87.5gb]/[88.9gb]}
Jul  8 19:33:17.784320 prod-flume-es-useast1-75-i-bc086b43.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-75-i-bc086b43-flume-elasticsearch-production_vpc-useast1] [gc][old][1040067][81117] duration [30.3s], collections [2]/[30.7s], total [30.3s]/[5.2h], memory [89.6gb]->[88gb]/[89.9gb], all_pools {[young] [649.7mb]->[264.2mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [88.8gb]->[87.7gb]/[88.9gb]}

Node 75s GC loop an hour later -- went on for an hour or more

Jul  8 20:28:19.741881 prod-flume-es-useast1-75-i-bc086b43.XXX  ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-75-i-bc086b43-flume-elasticsearch-production_vpc-useast1] [gc][old][1043322][81546] duration [29.8s], collections [1]/[30.4s], total [29.8s]/[5.3h], memory [88.8gb]->[88gb]/[89.9gb], all_pools {[young] [811.3mb]->[14.1mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [87.9gb]->[88gb]/[88.9gb]}
Jul  8 21:08:07.168273 prod-flume-es-useast1-75-i-bc086b43.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-75-i-bc086b43-flume-elasticsearch-production_vpc-useast1] [gc][old][1045672][81818] duration [29.8s], collections [1]/[30.2s], total [29.8s]/[5.3h], memory [89.6gb]->[88.5gb]/[89.9gb], all_pools {[young] [704.2mb]->[45.8mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [88.8gb]->[88.5gb]/[88.9gb]}
Jul  8 21:08:41.352528 prod-flume-es-useast1-75-i-bc086b43.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-75-i-bc086b43-flume-elasticsearch-production_vpc-useast1] [gc][old][1045679][81819] duration [27.2s], collections [1]/[28.1s], total [27.2s]/[5.3h], memory [89.3gb]->[88.6gb]/[89.9gb], all_pools {[young] [463.8mb]->[56.9mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [88.7gb]->[88.5gb]/[88.9gb]}
Jul  8 21:09:12.617369 prod-flume-es-useast1-75-i-bc086b43.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-75-i-bc086b43-flume-elasticsearch-production_vpc-useast1] [gc][old][1045685][81820] duration [25.5s], collections [1]/[25.7s], total [25.5s]/[5.3h], memory [89.6gb]->[88.7gb]/[89.9gb], all_pools {[young] [723.7mb]->[6.6mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [88.8gb]->[88.7gb]/[88.9gb]}
Jul  8 21:09:48.365129 prod-flume-es-useast1-75-i-bc086b43.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-75-i-bc086b43-flume-elasticsearch-production_vpc-useast1] [gc][old][1045692][81821] duration [28.6s], collections [1]/[29.7s], total [28.6s]/[5.3h], memory [89.4gb]->[88.9gb]/[89.9gb], all_pools {[young] [416.9mb]->[164.5mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.7gb]/[88.9gb]}
Jul  8 21:10:20.292452 prod-flume-es-useast1-75-i-bc086b43.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-75-i-bc086b43-flume-elasticsearch-production_vpc-useast1] [gc][old][1045698][81822] duration [26.4s], collections [1]/[26.9s], total [26.4s]/[5.3h], memory [89.5gb]->[88.9gb]/[89.9gb], all_pools {[young] [519.8mb]->[124.1mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [88.8gb]->[88.8gb]/[88.9gb]}
Jul  8 21:10:52.069332 prod-flume-es-useast1-75-i-bc086b43.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-75-i-bc086b43-flume-elasticsearch-production_vpc-useast1] [gc][old][1045703][81823] duration [27.4s], collections [1]/[27.6s], total [27.4s]/[5.3h], memory [89.7gb]->[88.8gb]/[89.9gb], all_pools {[young] [719mb]->[3.9mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.8gb]/[88.9gb]}

... goes on for over an hour

Node 77s GC loop -- stopped when we restarted ES

Jul  8 23:16:54.165901 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053381][74364] duration [32.1s], collections [1]/[33.3s], total [32.1s]/[4.4h], memory [88.9gb]->[88.6gb]/[89.9gb], all_pools {[young] [21.9mb]->[7.5mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [88.8gb]->[88.6gb]/[88.9gb]}
Jul  8 23:20:23.469300 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053557][74388] duration [31.1s], collections [2]/[32.3s], total [31.1s]/[4.4h], memory [89gb]->[89gb]/[89.9gb], all_pools {[young] [29.7mb]->[186.6mb]/[865.3mb]}{[survivor] [106mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.8gb]/[88.9gb]}
Jul  8 23:20:51.901650 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053560][74389] duration [25.8s], collections [1]/[26.4s], total [25.8s]/[4.4h], memory [89.5gb]->[88.9gb]/[89.9gb], all_pools {[young] [469mb]->[9.3mb]/[865.3mb]}{[survivor] [107.1mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:21:14.087709 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053561][74390] duration [21.6s], collections [1]/[22.1s], total [21.6s]/[4.4h], memory [88.9gb]->[88.9gb]/[89.9gb], all_pools {[young] [9.3mb]->[893.1kb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:21:38.369926 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053563][74391] duration [23.1s], collections [1]/[23.2s], total [23.1s]/[4.4h], memory [89.7gb]->[89gb]/[89.9gb], all_pools {[young] [780.3mb]->[28.3mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:22:03.804087 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053565][74392] duration [24.1s], collections [1]/[24.4s], total [24.1s]/[4.4h], memory [89.8gb]->[89gb]/[89.9gb], all_pools {[young] [865.3mb]->[59.4mb]/[865.3mb]}{[survivor] [2.2mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:22:28.031757 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053567][74393] duration [23.1s], collections [1]/[23.2s], total [23.1s]/[4.4h], memory [89.9gb]->[89gb]/[89.9gb], all_pools {[young] [865.3mb]->[33.2mb]/[865.3mb]}{[survivor] [103.6mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:22:53.483730 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053569][74394] duration [24s], collections [1]/[24.4s], total [24s]/[4.4h], memory [89.8gb]->[89gb]/[89.9gb], all_pools {[young] [854.5mb]->[37.3mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:23:19.178259 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053571][74395] duration [24.6s], collections [1]/[24.6s], total [24.6s]/[4.4h], memory [89.9gb]->[89gb]/[89.9gb], all_pools {[young] [865.3mb]->[32.7mb]/[865.3mb]}{[survivor] [83.7mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:23:42.444754 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053573][74396] duration [21.8s], collections [1]/[22.2s], total [21.8s]/[4.4h], memory [89.8gb]->[89gb]/[89.9gb], all_pools {[young] [839mb]->[66.6mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:24:07.569372 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053574][74397] duration [24.1s], collections [1]/[25.1s], total [24.1s]/[4.4h], memory [89gb]->[89.1gb]/[89.9gb], all_pools {[young] [66.6mb]->[182.4mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:24:30.145261 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053576][74398] duration [21.4s], collections [1]/[21.5s], total [21.4s]/[4.4h], memory [89.9gb]->[89gb]/[89.9gb], all_pools {[young] [865.3mb]->[34.2mb]/[865.3mb]}{[survivor] [73.5mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:24:54.803107 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053578][74399] duration [23.4s], collections [1]/[23.6s], total [23.4s]/[4.5h], memory [89.8gb]->[89gb]/[89.9gb], all_pools {[young] [865.3mb]->[57.8mb]/[865.3mb]}{[survivor] [49.2mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:25:17.380932 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053580][74400] duration [21.3s], collections [1]/[21.5s], total [21.3s]/[4.5h], memory [89.8gb]->[89gb]/[89.9gb], all_pools {[young] [865.3mb]->[66.1mb]/[865.3mb]}{[survivor] [42.8mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:25:43.662986 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053582][74401] duration [25.2s], collections [1]/[25.2s], total [25.2s]/[4.5h], memory [89.9gb]->[89gb]/[89.9gb], all_pools {[young] [865.3mb]->[82.3mb]/[865.3mb]}{[survivor] [94.4mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:26:08.731447 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053583][74402] duration [24s], collections [1]/[25s], total [24s]/[4.5h], memory [89gb]->[89gb]/[89.9gb], all_pools {[young] [82.3mb]->[44.3mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:26:32.904273 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053585][74403] duration [22.8s], collections [1]/[23.1s], total [22.8s]/[4.5h], memory [89.8gb]->[89gb]/[89.9gb], all_pools {[young] [865.3mb]->[55.6mb]/[865.3mb]}{[survivor] [31.9mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:26:58.115853 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053587][74404] duration [24.1s], collections [1]/[24.2s], total [24.1s]/[4.5h], memory [89.9gb]->[89.1gb]/[89.9gb], all_pools {[young] [865.3mb]->[118.2mb]/[865.3mb]}{[survivor] [81.7mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:27:24.618880 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053589][74405] duration [25.4s], collections [1]/[25.5s], total [25.4s]/[4.5h], memory [89.9gb]->[89gb]/[89.9gb], all_pools {[young] [865.3mb]->[73.2mb]/[865.3mb]}{[survivor] [89.1mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:27:48.702525 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053590][74406] duration [23.2s], collections [1]/[24s], total [23.2s]/[4.5h], memory [89gb]->[89gb]/[89.9gb], all_pools {[young] [73.2mb]->[18.3mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:28:14.687780 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053592][74407] duration [24.7s], collections [1]/[24.9s], total [24.7s]/[4.5h], memory [89.8gb]->[89gb]/[89.9gb], all_pools {[young] [865.3mb]->[29.7mb]/[865.3mb]}{[survivor] [21.6mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:28:36.859798 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053593][74408] duration [21.3s], collections [1]/[22.1s], total [21.3s]/[4.5h], memory [89gb]->[89gb]/[89.9gb], all_pools {[young] [29.7mb]->[7.8mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:29:01.279278 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053595][74409] duration [22.7s], collections [1]/[23.4s], total [22.7s]/[4.5h], memory [89.6gb]->[89gb]/[89.9gb], all_pools {[young] [713.9mb]->[4.2mb]/[865.3mb]}{[survivor] [0b]->[69.9mb]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:29:25.303766 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053596][74410] duration [23.1s], collections [1]/[24s], total [23.1s]/[4.5h], memory [89gb]->[88.9gb]/[89.9gb], all_pools {[young] [4.2mb]->[1.5mb]/[865.3mb]}{[survivor] [69.9mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:29:47.080598 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053597][74411] duration [20.8s], collections [1]/[21.7s], total [20.8s]/[4.5h], memory [88.9gb]->[88.9gb]/[89.9gb], all_pools {[young] [1.5mb]->[6.1mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}

clintongormley commented 9 years ago

@diranged that's not surprising. you have enormous heaps (90GB!) and they're full. You need some tuning advice. I suggest asking about that on the forum: http://discuss.elastic.co/

pavanIntel commented 9 years ago

If anybody facing issue with following error:

exception caught on transport layer [[id: 0x32cd6d09, /172.31.6.91:38524 => /172.31.18.78:9300]], closing connectionjava.io.StreamCorruptedException: invalid internal transport message format, got (50,4f,53,54)

The 9300 port is used for inter-node communication and uses an internal binary protocol, thus you can't use it from the browser. In order to use the 9300 port you have to use the Java API, either Node client or Transport client, which both understand the internal binary protocol.

From the browser you should only use the 9200 port, which is the one that exposes the REST API.

If you are using Amazon LoadBalancer to access your cluster: then you should change your instance protocol in Listener settings to 9200.

faxm0dem commented 9 years ago

@pavanIntel and how is this relevant?

elastic / elasticsearch