dragosrosculete commented 9 years ago

Hey,

I am having trouble for some while. I am getting random node disconnects and I cannot explain why. There is no increase in traffic ( search or index ) when this is happening , it feels so random to me . I first thought it could be the aws cloud plugin so I removed it and used unicast and pointed directly to my nodes IPs but that didn't seem to be the problem . I changed the type of instances, now m3.2xlarge, added more instances, made so much modifications in ES yml config and still nothing . Changed java oracle from 1.7 to 1.8 , changed CMS collector to G1GC and still nothing .

I am out of ideas ... how can I get more info on what is going on ?

Here are the logs I can see from master node and the data node http://pastebin.com/GhKfRkaa

markwalkom commented 9 years ago

This may be a more suitable question for the mailing list(?).

How big is your cluster; node count, index count, shards and replicas, total data size, heap size, ES version? What is your config?

dragosrosculete commented 9 years ago

Right now : 6 m3.x2large, 1 master, 5 data nodes. 414 indices, index/day 7372 shards. 9 shards, 1 replica per index 208 million documents, 430 GB 15 gb heap size allocated per node ES 1.4.2

Current yml config here : http://pastebin.com/Nmdr7F6J

TinLe commented 9 years ago

That's a lot of shards. You might be running into resource limits. Check your ulimit on filehandles.

dragosrosculete commented 9 years ago

File limit in not the problem, it is set to : elasticsearch - nofile 65535 elasticsearch - memlock unlimited

I have another cluster with older ES version, with over 9000 shards on 3 nodes and my nodes don't get random disconnects there .

markwalkom commented 9 years ago

Can you try closing some old indices for a while and seeing if it helps?

dragosrosculete commented 9 years ago

Ok, I will try that to 1/3 of my total shards and see what happens. I will do it Monday so have a nice weekend till than :)

bleskes commented 9 years ago

@Revan007 did you see any long GCs on the node?

Can you set the transport.netty log to DEBUG (we may see more there as to why it disconnected) and also it would be great if can post node stats + info (_nodes and _nodes/stats) so we can get more info about the resources of the nodes.

rangagopalan commented 9 years ago

Hi,

I am also seeing the same issue on our 3 node cluster - I have posted a lot of details in the elasticsearch users group here: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/elasticsearch/fQkhRJ6Md9c

I've messed around with lowering tcp keep alive, setting the fd.ping_timeout etc. and the issue is still happening.

The root cause seems to be a Netty level timeout - {code} [2015-01-08 17:32:18,216][TRACE][transport.netty ] [es1] close connection exception caught on transport layer [[id: 0xc4e4b9a1, /10.152.16.37:59038 => /10.109.172.201:9300]], disconnecting from relevant node java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64) at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318) at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code}

The last change I tried was to increase the fd.ping_timeout by setting the following last night - the issue is still happening frequently and causing failures on the client side in our application. {code} discovery.zen.fd.ping_timeout: 60s discovery.zen.fd.ping_retries: 10 {code}

dragosrosculete commented 9 years ago

How do I enable transport.netty debug logging ?

I have also read this https://github.com/foundit/elasticsearch-transport-module - Recommended tweaks to existing settings: how can I modify this values in ES ?

rangagopalan commented 9 years ago

You can do something like the following: (The following works - I wasn't sure whether logger.transport or logger.org.elasticsearch.transport was the right one - I tried both and this works)

curl -XPUT :9200/_cluster/settings -d ' { "transient" : { "logger.transport" : "TRACE", "logger.org.elasticsearch.transport" : "TRACE" } }'

dragosrosculete commented 9 years ago

Thank you @rangagopalan .

I guess I will wait for another node failure to see the results . Here is the _nodes info http://pastebin.com/LaThtSBV

and here is the _nodes/stats info http://pastebin.com/mq0CuVM2

dragosrosculete commented 9 years ago

My node failed... this are the logs, in addition to the ones posted before :

http://pastebin.com/BkfXUw7K

rangagopalan commented 9 years ago

@Revan007 and @bleskes - This issue seemed to be faced by others as well - See https://groups.google.com/forum/#!searchin/elasticsearch/ReceiveTimeoutTransportException/elasticsearch/EvDNO_vALNg/mVzurNj1KRAJ

The solution there was to try to reduce the number of shards/indices/replicas combination since that might help the cluster:/nodes/stats API return within the hard-coded timeout of 15 seconds (assuming that is the cause of the connection failures).

I am trying it here on my 3-node cluster (we recently switched to time-based indexes and at least temporarily for testing I was able to close out a bunch of older time-based indices to reduce the total shards+replicas I have from 1200 to about 450). I will post again on whether this helped eliminate / reduce the timeouts or not in a few hours.

If this works, perhaps we can also request an enhancement to make this 15 second timeout configurable for use in clusters where there are a larger number of shards / indices. (I believe the hard-coded limit is in the following code - but I could be mistaken - https://github.com/elasticsearch/elasticsearch/blob/3712d979519db5453bea49c34642a391c51d88b3/src/main/java/org/elasticsearch/cluster/InternalClusterInfoService.java)

bleskes commented 9 years ago

@Revan007 - it seems something is causing new-elastic-data3 to shutdown, which triggered the disconnect f - I assume this is not what you mean?

2015-01-09 22:55:06,608][INFO ][node                     ] [new-elastic-data3] stopped
[2015-01-09 22:55:06,609][INFO ][node                     ] [new-elastic-data3] closing ...
[2015-01-09 22:55:07,310][INFO ][node                     ] [new-elastic-data3] closed
[

[2015-01-09 22:55:06,599][TRACE][transport.netty          ] [new-elastic-master] disconnected from [[new-elastic-data3][IGz2I5e4ToSho_y6Le6cLA][new-elastic-data3][inet[new-elastic-data3/10.33.181.140:9300]]{master=false}], channel closed even

@rangagopalan theses stats should be very fast (pending some old solved bugs - which eversion are you on?). Any reason why you point at the InternalClusterInfoService which is not used for client calls but rather for the disk threshold allocation decider. Do you have issues there?

rangagopalan commented 9 years ago

Am using 1.4.2 - more details of my env are in the post here: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/elasticsearch/fQkhRJ6Md9c

No I wasn't using the internal class in any way - I was just theorizing/postulating that since the timeout was seen in the nodes/stats call perhaps this 15 second timeout set in the internal cluster info service class was applicable - The test I am trying is to reduce the time for any cluster statistics calls by reducing the number of indices/replicas/shards combination (what I am doing may not be applicable if the stats apis return really fast always and aren't dependent on the number of indexes) - I am trying out what was posted in the other link that I referred to - to try and see if the timeouts / disconnects stop if I reduce the total number of indexes/shards -

dragosrosculete commented 9 years ago

@bleskes . That was when I did a manual restart of that node. If I let it recover on its own it takes like 10 minutes.

Here is the data node log when I let it recover on its own http://pastebin.com/wwyvnGJT

bleskes commented 9 years ago

@Revan007 what does take 10 minutes exactly? I want to make sure I understand the sequence of events.

For what it's worth it may these are genuine network issues. The last error in your logs show a socket read timeout:

[2015-01-10 16:52:37,969][TRACE][transport.netty          ] [new-elastic-data4] close connection exception caught on transport layer [[id: 0x45972790, /10.97.131.39:35885 => /10.193.3.95:9300]], disconnecting from relevant node
java.io.IOException: Connection timed out
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)

dragosrosculete commented 9 years ago

Hmm... it has been almost 20 hours now and no node disconnect so far. I am not sure if my change fixed it. I will wait a little longer till I can say for sure .

rangagopalan commented 9 years ago

An update - it's been about 18 hours since I closed a bunch of older indexes to reduce the total number of shards to below 500 from about 1100. No timeouts since then.

dragosrosculete commented 9 years ago

Ok, it did happen again, but after 22 hours now . The change that I added yesterday : /sbin/sysctl -w net.ipv4.tcp_keepalive_time=200 net.ipv4.tcp_keepalive_intvl=200 net.ipv4.tcp_keepalive_probes=5

It took ~17 minutes to d/c the node and go to yellow state and recover . That means for 17 minutes when I tried to query search the master it wouldn't give me any response .

Here is the complete log from Master and Data Node : http://pastebin.com/3twDqdxG

dragosrosculete commented 9 years ago

Damn , I still have received 2 more node timeouts so I guess I was just lucky before that it took more time .

@rangagopalan still going ok for you after reducing the number of open shards ?

I don't find this a solution because I am using 1 index per day with 9 shards and I have months of data, I will be over 3000 shards whatever I do ...

TinLe commented 9 years ago

I used to run into similar issues until I reduced the number of active shards. I was using 15 shards per day, up to 30 shards per day now as I increased my nodes. The way to reduce the total number of shards is to close off old indices. I only leave two weeks worth open, the rest I close unless my users requested access to them.

Tin

On Sun, Jan 11, 2015 at 11:51 AM, Revan007 notifications@github.com wrote:

Damn , I still have received 2 more node timeouts so I guess I was just lucky before that it took more time .

@rangagopalan https://github.com/rangagopalan still going ok for you after reducing the number of open shards ?

I don't find this a solution because I am using 1 index per day with 9 shards and I have months of data, I will be over 3000 shards whatever I do ...

— Reply to this email directly or view it on GitHub https://github.com/elasticsearch/elasticsearch/issues/9212#issuecomment-69508312 .

rangagopalan commented 9 years ago

@Revan007 - Yeah - things look fine here since the reduction of indexes/shards- no timeouts - (going for more than 26 hours now) -

In our cluster we had recently switched a couple of indices to be time-based - one with a daily index and one with a weekly index. This significantly increased the number of indices - since we had to re-index some historical data as well to be time-based. I was able to close out a many of the daily indexes and keep just a week's worth of data online. For the second index we need to keep a year's worth of data and I'm thinking of switching to a monthly index (we might need to make changes in our application to make that work). Right now falling back to the older (non-time-based index) via an alias to keep the application working.

bleskes commented 9 years ago

@Revan007 @rangagopalan there two things at play here - timeouts on indices / node stats and network level disconnects. They are two different things.

On slow file system - indices stats (and to a lesser degree node stats) can take longer to complete you have many shards on a node. In extreme cases this can cause a time out. This is how ever very different then network disconnects which should not be affected by the number of shards.

That said it is possible that the shards put too much load on the node causing it not to respond on time to pings from the master which in turn causes the node to be thrown off the cluster and the network channels to be closed. This is how not the cases as you can see the network is closed because of a socket level time out (still possible under load but less likely):

2015-01-11 14:27:49,966][TRACE][transport.netty          ] [new-elastic-master] close connection exception caught on transport layer [[id: 0x4ceb2806, /10.97.131.39:39658 => new-elastic-data3/10.33.181.140:9300]], disconnecting from relevant node
java.io.IOException: Connection timed out
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)

@Revan007 : what do you mean exactly when you say "I tried to query search the master it wouldn't give me any response ." - did search requests to the master not return?

A couple of side notes:

For what it's worth daily indices with 9 shards is quite a lot. Given the fact that you use 5 data nodes total it seems to me that you do not need the indexing capacity of 9 shards and that 5 shards will do just great (maybe event less).
it is not a good idea to run your queries through the master. Rather, send them to any of the data nodes. The master should be free and available to do cluster level work. Also a single master is obviously a single point of failure

dragosrosculete commented 9 years ago

@bleskes Hey, I have added a balancer node: master false, data false, and use query against it now . This is good because now even if the the indices stats problem starts and takes 17 minutes to d/c the failed node the query still works fine. So at least now I am having a functional cluster .

I will reduce the number of shards to 5 soon per index but I cannot reduce the number of indexes, they have to be per day because of the amount of data. So in 1 year I will still have 365 x 3 = 1095 indices. ( 3 types of indices)

The indices stats timeout of 15000 ms has to modified somehow in a future ES update/fix .

bleskes commented 9 years ago

@Revan007 I still think there is some confusion here as the the Indices stats API doesn't have a timeout. That 15s timeout reference (which I agree should be configurable) is to an internal disk free space monitoring calls issued from the master.

I'm sorry - but it is not clear to me what you exactly you mean when you say " takes 17 minutes to d/c the failed node the query still works fine." - node stats / indices stats should never be in the way of Search calls - they use another thread pool. I think something else is going on and it seems to all point at networking issues or general node load - do you monitor these?

I will reduce the number of shards to 5 soon per index but I cannot reduce the number of indexes, they have to be per day because of the amount of data. So in 1 year I will still have 365 x 3 = 1095 indices. ( 3 types of indices)

Thats OK - just make sure you need 5 shards - you might be surprised how much you can get out of a single one.

dragosrosculete commented 9 years ago

@bleskes

I'm sorry - but it is not clear to me what you exactly you mean when you say " takes 17 minutes to d/c the failed node the query still works fine." - node stats / indices stats should never be in the way of Search calls - they use another thread pool

Well, before, when I was querying the master node (doing a search , curl -XGET bla bla bla) when a node started to fail :

[2015-01-11 14:15:01,014][DEBUG][action.admin.cluster.node.stats] [new-elastic-master] failed to execute on node [f_K6HsHZQdKaiGsgFMqjRQ] org.elasticsearch.transport.ReceiveTimeoutTransportException: [new-elastic-data3][inet[new-elastic-data3/10.33.181.140:9300]][cluster:monitor/nodes/stats[n]] request_id [135582894] timed out after [15000ms]

I would get no result , nothing would return. It takes like 17 minutes for the master to start spitting tons of errors and disconnect that node , than is starts to recover the shards when the nodes is reconnected and could query it and receive results .

Now I am querying the balancer node and even if it the node starts to fail at least I am able to query during that time .

Thats OK - just make sure you need 5 shards - you might be surprised how much you can get out of a single one.

So because I have 5 nodes I should preferably have 5 shards, 1 per node. Should I understand that If I now start another 4 nodes and have 9 nodes, 1 shard per node I should not see those disconnects anymore ?

bleskes commented 9 years ago

It really sounds to me like there are some networking issues causing both symptoms. Can you try monitoring that out of ES? If you don't have any monitoring infra in place maybe start a ping process or something between the nodes

So because I have 5 nodes I should preferably have 5 shards, 1 per node.

Remember you have replicas and also more indices (one per day) that should be spread around. I would try 1 shard and see if you have enough indexing / search capacity based on your data volumes. That will also give you an idea how much a shard can take and based on that decide how many shards you need.

Should I understand that If I now start another 4 nodes and have 9 nodes, 1 shard per node I should not see those disconnects anymore ?

I still don't see any correlation between the number of shards and disconnects.

rangagopalan commented 9 years ago

Just an update confirming that there is definitely some kind of problem related to the number of shards per node -

We had a separate testing system that was working fine - a single node that had about 647 shards/replicas (set up with 1 replica - so there were 647 unassigned shards)

When I added a second node to this test cluster, it readjusted and became green as expected. But after a short while I started seeing the same timeouts logged and the nodes disconnected/reconnected as seen earlier on the 3-node cluster that I reported about earlier.

{code} [2015-01-16 19:34:55,117][DEBUG][action.admin.cluster.node.stats] [es2] failed to execute on node [wZDxvQoKSP2z3N7xGoiYew] org.elasticsearch.transport.ReceiveTimeoutTransportException: [es3][inet[/10.169.231.15:9300]][cluster:monitor/nodes/stats[n]] request_id [109468] timed out after [15000ms] at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) [2015-01-16 19:37:25,209][DEBUG][action.admin.cluster.node.stats] [es2] failed to execute on node [wZDxvQoKSP2z3N7xGoiYew] org.elasticsearch.transport.ReceiveTimeoutTransportException: [es3][inet[/10.169.231.15:9300]][cluster:monitor/nodes/stats[n]] request_id [111394] timed out after [15000ms] at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code}

I believe that (as suggested here too: https://groups.google.com/d/msg/elasticsearch/EvDNO_vALNg/BPK5yYSUFeQJ) that there is some kind of problem within Elasticsearch node-communications/monitoring when there is a multi-node cluster and the number of shards/replicas per cluster goes above a certain number. From our experience we can say that 647 shards/replicas per node surely causes it (on our testing server) and on the production cluster I believe we saw the issue at about 400 or 450 shards/replicas per node.

I think at the very least the Elasticsearch documentation should be updated to provide recommendations/guidelines on cluster sizing to avoid this kind of issue.

bleskes commented 9 years ago

@rangagopalan sorry for the late response, I was travelling

a short while I started seeing the same timeouts and the nodes disconnected/reconnected

The log you paste only show the timeouts, so I presume this is what you mean.

From our experience we can say that 647 shards/replicas per node surely causes it (on our testing server)

The timeout suggest the node stats takes more then 15 seconds to complete. The stats walks all the shard on the node and collects information about them (like number of docs etc.). In theory this does mean that there is an upper bound to the number of shards it can do in 15s , but this is a lot under normal circumstances. It it still interesting things are so slow in your instance. Which kind disks are you using, and just to confirm, which ES version are you on?

rangagopalan commented 9 years ago

We are using 1.4.2 - We have this deployed on AWS - the nodes are in different availability zones in the US east region - and we are using 4x1TB EBS disks in RAID-0 arrays on each node

bleskes commented 9 years ago

@rangagopalan I see. It might be that EBS may slower at times. To exclude the network component, can you try running GET _nodes/_local/stats periodically on each on the node and measure how long it takes? it will not time out as it is local and doesn't have a timeout.

Also - just to be extra clear on this - these timeout come from the internal disk usage collection of the Disk Threshold Allocator, in charge of making sure shards are not allocated to nodes that don't have disk space for them. A hick up in this check is not a big deal. It will just pick up 30s later.

rangagopalan commented 9 years ago

Hi @bleskes - Looks like the local _nodes/_local/stats is very fast - milliseconds - when I run it on any of the nodes.

I suspect there is some other issue that is causing contention / blocking on some internal Elasticsearch threads/resources when the number of shards/replicas goes above a certain number - Perhaps the monitor/nodes/stats timeouts seen in the logs are a symptom, but the root cause could be elsewhere.

Even if the monitor/nodes/stats timeouts are harmless, it is a fact that there are disconnects between the master node and the data nodes regularly/frequently when the number of shards/replicas per node goes above a certain number - and once we reduce the number of shards/replicas per node without making any other changes, the master node -> datanode disconnects are never seen.

nicktgr15 commented 9 years ago

Hi,

Seeing a similar issue after updating from 1.1.1 to 1.4.2 on the master node. The cluster is running on aws with 3 master, 3 search/client, 9 data nodes. The following error message appeared on the newly elected master node after the previous master node was killed by chaos monkey.

I had to manually restart the elasticsearch service on the master node to restore data flow to the cluster.

[2015-02-09 16:38:43,835][DEBUG][action.admin.cluster.node.stats] [Frey] failed to execute on node [CtXB6C67Tmyw770IPsiVng]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [MN-E (Ultraverse)][inet[/xx.xx.xx.xx:9300]][cluster:monitor/nodes/stats[n]] request_id [255429] timed out after [15001ms]
    at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Update

In an attempt to reproduce the issue I manually killed the elected master, afterwards an eligible master node was elected as the new master so the cluster was in green state however there was not data flow and I saw similar log messages in the master node logs.

[2015-02-10 10:18:48,634][DEBUG][action.admin.cluster.node.stats] [Mad Jack] failed to execute on node [PJpqL98gSM-QcagYUUGHmw]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [Mist Mistress][inet[/xx.xx.xx.xx:9300]][cluster:monitor/nodes/stats[n]] request_id [51775] timed out after [15001ms]
    at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
[2015-02-10 10:18:48,634][DEBUG][action.admin.cluster.node.stats] [Mad Jack] failed to execute on node [hlXZBBV6T26p-Swj5HZ1Bg]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [Tutinax the Mountain-Mover][inet[/xx.xx.xx.xx:9300]][cluster:monitor/nodes/stats[n]] request_id [51779] timed out after [15000ms]
    at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

I will not restart the elasticsearch service this time as I want to see wether the cluster will be able to recover on its own.

After 35 minutes and after the elected master hit 100% heap space usage and 100% cpu usage it started printing the following in the logs and a new master was elected. However [Mad Jack] was never dropped from the cluster.

2015-02-10 10:50:55,373][WARN ][transport                ] [Mad Jack] Received response for a request that has timed out, sent [139664ms] ago, timed out [362ms] ago, action [internal:cluster/nodes/indices/shard/store[n]], node [[Mist Mistress][PJpqL98gSM-QcagYUUGHmw][ip-xx-xx-xx-xx.eu-west-1.compute.internal][inet[/xx.xx.xx.xx:9300]]{master=false}], id [286982]
[2015-02-10 10:50:55,385][WARN ][transport                ] [Mad Jack] Received response for a request that has timed out, sent [137489ms] ago, timed out [409ms] ago, action [cluster:monitor/nodes/stats[n]], node [[MN-E (Ultraverse)][CtXB6C67Tmyw770IPsiVng][ip-xx-xx-xx-xx.eu-west-1.compute.internal][inet[/xx.xx.xx.xx:9300]]{master=false}], id [286993]
[2015-02-10 10:50:55,412][WARN ][transport                ] [Mad Jack] Received response for a request that has timed out, sent [137516ms] ago, timed out [507ms] ago, action [cluster:monitor/nodes/stats[n]], node [[Fin Fang Foom][lEsdGeowQLOmk13iaNbUnw][ip-xx-xx-xx-xx.eu-west-1.compute.internal][inet[/xx.xx.xx.xx:9300]]{master=false}], id [286987]
[2015-02-10 10:50:55,449][WARN ][transport                ] [Mad Jack] Received response for a request that has timed out, sent [295777ms] ago, timed out [89025ms] ago, action [internal:discovery/zen/fd/ping], node [[Shatterfist][mDKyBNBlQ0q5LQ_uJTg5iQ][ip-xx-xx-xx-xx.eu-west-1.compute.internal][inet[/xx.xx.xx.xx:9300]]{data=false, master=true}], id [286820]

gurvindersingh commented 9 years ago

We have seen the similar behavior and it is related to large number of shards allocated per node. So if you reduce the number of shards issue will go away. But I still think it is an issue with ELS. As having large number of shards should not simple force node to disconnect. I do get the point that a single shard is like a single lucene copy. It was working before as people reported in 1.1.x or 1.2.0 what has changed since then that we got this regression ?

bleskes commented 9 years ago

@gurvindersingh we spent quire a bit of time and found the source of regression (see https://github.com/elasticsearch/elasticsearch/pull/9683#issuecomment-74345290) . We used to cache an in memory representation of the file system. That caching removed to simplify the code base (considerably) but a side effect was that FS metadata calls have to be made on each stats (like total file size etc.). This ended slowing things down on some FS . Since then we added a stats level cache which is much simpler to do ( #9709 ). We are also busy reducing the FS meta data calls to a required minimum.

gurvindersingh commented 9 years ago

@bleskes Thanks for the information. I can see that upcoming patches might solve this problem. Is there any plan to port those changes back to 1.3 or 1.4 branch. As it seems to be targeted for 1.5 and 2.0 mainly.

bleskes commented 9 years ago

@gurvindersingh FYI I back ported the change to 1.4

gurvindersingh commented 9 years ago

Thanks @bleskes

sagarl commented 9 years ago

@bleskes Just wanted to confirm that you back ported the changes to 1.4.5 and the patch won't be available for prior 1.4.x versions (eg. 1.4.1) .. is that correct ? If yes, what's the timeline of releasing 1.4.5 ? How about #9683 ? Will it be back ported to 1.4.5 too ? I have filed #10003. And it seems related to this and #9683

gurvindersingh commented 9 years ago

9709 fixes #9683 too. Atleast the comments on the 9709 shows that. 9683 was closed in favor off 9709. Check out both of them, may be I am wrong ;)

schlitzered commented 9 years ago

i also have this issue with ELS 1.4.4. here is a call to _nodes/_local/stats:

[root@ip-10-0-5-102 logs]# time curl 'localhost:9200/_nodes/_local/stats' {"cluster_name":"syslog","nodes":{"2MfX3MGIQB-aIvFJZ1U4QA":{"timestamp":1426616443943,"name":"i-3e2c42d8","transport_address":"inet[ip-10-0-5-102.eu-west-1.compute.internal/10.0.5.102:9300]","host":"ip-10-0-5-102","ip":["inet[ip-10-0-5-102.eu-west-1.compute.internal/10.0.5.102:9300]","NONE"],"attributes":{"rack":"eu-west-1c"},"indices":{"docs":{"count":84450103,"deleted":7},"store":{"size_in_bytes":64526935028,"throttle_time_in_millis":819748},"indexing":{"index_total":11590569,"index_time_in_millis":8028168,"index_current":10616,"delete_total":0,"delete_time_in_millis":0,"delete_current":0,"noop_update_total":0,"is_throttled":false,"throttle_time_in_millis":0},"get":{"total":31,"time_in_millis":17,"exists_total":31,"exists_time_in_millis":17,"missing_total":0,"missing_time_in_millis":0,"current":0},"search":{"open_contexts":2,"query_total":58441,"query_time_in_millis":2133601,"query_current":0,"fetch_total":11083,"fetch_time_in_millis":32747,"fetch_current":0},"merges":{"current":1,"current_docs":48532,"current_size_in_bytes":17748683,"total":101910,"total_time_in_millis":39865843,"total_docs":763114841,"total_size_in_bytes":326663914252},"refresh":{"total":934696,"total_time_in_millis":9583115},"flush":{"total":1341,"total_time_in_millis":174631},"warmer":{"current":0,"total":377049,"total_time_in_millis":214590},"filter_cache":{"memory_size_in_bytes":130343904,"evictions":0},"id_cache":{"memory_size_in_bytes":0},"fielddata":{"memory_size_in_bytes":429796620,"evictions":0},"percolate":{"total":0,"time_in_millis":0,"current":0,"memory_size_in_bytes":-1,"memory_size":"-1b","queries":0},"completion":{"size_in_bytes":0},"segments":{"count":2720,"memory_in_bytes":552585548,"index_writer_memory_in_bytes":460744,"index_writer_max_memory_in_bytes":945826196,"version_map_memory_in_bytes":13920,"fixed_bit_set_memory_in_bytes":0},"translog":{"operations":18836,"size_in_bytes":17},"suggest":{"total":0,"time_in_millis":0,"current":0},"query_cache":{"memory_size_in_bytes":0,"evictions":0,"hit_count":0,"miss_count":0}},"os":{"timestamp":1426616445040,"uptime_in_millis":365218,"load_average":[0.8,0.73,0.78],"cpu":{"sys":40,"user":58,"idle":0,"usage":98,"stolen":0},"mem":{"free_in_bytes":521428992,"used_in_bytes":15083851776,"free_percent":34,"used_percent":65,"actual_free_in_bytes":5342273536,"actual_used_in_bytes":10263007232},"swap":{"used_in_bytes":0,"free_in_bytes":0}},"process":{"timestamp":1426616445040,"open_file_descriptors":9483,"cpu":{"percent":28,"sys_in_millis":39366270,"user_in_millis":95162100,"total_in_millis":134528370},"mem":{"resident_in_bytes":9735311360,"share_in_bytes":259915776,"total_virtual_in_bytes":29860155392}},"jvm":{"timestamp":1426616445047,"uptime_in_millis":240902797,"mem":{"heap_used_in_bytes":3531707288,"heap_used_percent":41,"heap_committed_in_bytes":8572502016,"heap_max_in_bytes":8572502016,"non_heap_used_in_bytes":79473584,"non_heap_committed_in_bytes":119545856,"pools":{"young":{"used_in_bytes":35675264,"max_in_bytes":139591680,"peak_used_in_bytes":139591680,"peak_max_in_bytes":139591680},"survivor":{"used_in_bytes":11690072,"max_in_bytes":17432576,"peak_used_in_bytes":17432576,"peak_max_in_bytes":17432576},"old":{"used_in_bytes":3484341952,"max_in_bytes":8415477760,"peak_used_in_bytes":6757907936,"peak_max_in_bytes":8415477760}}},"threads":{"count":55,"peak_count":17849},"gc":{"collectors":{"young":{"collection_count":167128,"collection_time_in_millis":7525529},"old":{"collection_count":33,"collection_time_in_millis":6107}}},"buffer_pools":{"direct":{"count":4774,"used_in_bytes":85086885,"total_capacity_in_bytes":85086885},"mapped":{"count":4106,"used_in_bytes":19142984579,"total_capacity_in_bytes":19142984579}}},"thread_pool":{"generic":{"threads":2,"queue":0,"active":0,"rejected":0,"largest":17785,"completed":2379948},"index":{"threads":2,"queue":0,"active":0,"rejected":0,"largest":2,"completed":5},"bench":{"threads":0,"queue":0,"active":0,"rejected":0,"largest":0,"completed":0},"get":{"threads":2,"queue":0,"active":0,"rejected":0,"largest":2,"completed":31},"snapshot":{"threads":0,"queue":0,"active":0,"rejected":0,"largest":0,"completed":0},"merge":{"threads":1,"queue":0,"active":0,"rejected":0,"largest":1,"completed":1438988},"suggest":{"threads":0,"queue":0,"active":0,"rejected":0,"largest":0,"completed":0},"bulk":{"threads":2,"queue":0,"active":0,"rejected":0,"largest":2,"completed":4997737},"optimize":{"threads":0,"queue":0,"active":0,"rejected":0,"largest":0,"completed":0},"warmer":{"threads":1,"queue":0,"active":0,"rejected":0,"largest":1,"completed":1020397},"flush":{"threads":1,"queue":0,"active":0,"rejected":0,"largest":1,"completed":375564},"search":{"threads":6,"queue":0,"active":0,"rejected":0,"largest":6,"completed":104908},"listener":{"threads":1,"queue":0,"active":0,"rejected":0,"largest":1,"completed":7414},"percolate":{"threads":0,"queue":0,"active":0,"rejected":0,"largest":0,"completed":0},"management":{"threads":5,"queue":1221,"active":5,"rejected":0,"largest":5,"completed":12301305},"refresh":{"threads":1,"queue":0,"active":1,"rejected":0,"largest":1,"completed":934691}},"network":{"tcp":{"active_opens":42156,"passive_opens":9742,"curr_estab":137,"in_segs":47202218,"out_segs":47240353,"retrans_segs":5884,"estab_resets":262,"attempt_fails":683,"in_errs":0,"out_rsts":4109}},"fs":{"timestamp":1426616445048,"total":{"total_in_bytes":214643507200,"free_in_bytes":146611564544,"available_in_bytes":146611564544,"disk_reads":505620,"disk_writes":5957779,"disk_io_op":6463399,"disk_read_size_in_bytes":33000827392,"disk_write_size_in_bytes":368532620288,"disk_io_size_in_bytes":401533447680,"disk_queue":"0","disk_service_time":"0"},"data":[{"path":"/mnt/xvdb/elasticsearch/data/syslog/nodes/0","mount":"/mnt/xvdb","dev":"/dev/xvdb","total_in_bytes":214643507200,"free_in_bytes":146611564544,"available_in_bytes":146611564544,"disk_reads":505620,"disk_writes":5957779,"disk_io_op":6463399,"disk_read_size_in_bytes":33000827392,"disk_write_size_in_bytes":368532620288,"disk_io_size_in_bytes":401533447680,"disk_queue":"0","disk_service_time":"0"}]},"transport":{"server_open":65,"rx_count":21365099,"rx_size_in_bytes":22635080095,"tx_count":11759575,"tx_size_in_bytes":86814045445},"http":{"current_open":3,"total_opened":781},"breakers":{"fielddata":{"limit_size_in_bytes":5143501209,"limit_size":"4.7gb","estimated_size_in_bytes":429796620,"estimated_size":"409.8mb","overhead":1.03,"tripped":0},"request":{"limit_size_in_bytes":3429000806,"limit_size":"3.1gb","estimated_size_in_bytes":166816680,"estimated_size":"159mb","overhead":1.0,"tripped":0},"parent":{"limit_size_in_bytes":6000751411,"limit_size":"5.5gb","estimated_size_in_bytes":596613300,"estimated_size":"568.9mb","overhead":1.0,"tripped":0}}}}} real 0m48.550s user 0m0.002s sys 0m0.005s

is the latest 1.3.x release also affected by this bug? if not, i will downgrade my cluster to this release.

rookie7799 commented 9 years ago

We are having this same issue on 1.4.4 as well, in our case, however, this is being triggered by long running aggregation query (which triggers fielddata population)

faxm0dem commented 9 years ago

I'm having an issue which awfully looks like this one on 1.4.2. Everything went smoothly until I added 3 data nodes to my 6-node cluster. I am seeing the new nodes fail (the old ones never do): they seem to deadlock, can't even do a jstack on them. Are you guys seeing deadlocks too on the failed node? (api times out)

clintongormley commented 9 years ago

@faxm0dem are you using Ubuntu? I'm wondering if this kernel bug might be at fault? https://github.com/elastic/elasticsearch/issues/10447#issuecomment-90303768

faxm0dem commented 9 years ago

@clintongormley no Scientific Linux 6 actually using openjdk-1.7.0

faxm0dem commented 9 years ago

FWIW deleting a couple of indices and closing others stabilized my cluster. Can you tell us exaclty what is the limit of indices and/or shards per node below this bug won't be triggered? Also, will upgrading to 1.5.1 help?

clintongormley commented 9 years ago

@faxm0dem there is no hard limit. Depends on many factors such as your hardware, the size of your mappings, the config (eg are you absolutely sure you have swap disabled?), etc. And things will change again once we add cluster state diffs (https://github.com/elastic/elasticsearch/pull/10212).

faxm0dem commented 9 years ago

Yes, swap is disabled and message about ES failing to lock memory is not present. I had no failed node since closing indices. I now have about 1340 shards per node. I'll report back when nodes are failing again

clintongormley commented 9 years ago

@faxm0dem 1340 shards per node is silly high, by any count. it's difficult to say "no more than X" because there are so many factors involved, but I'd say that you're asking for trouble with so many. Either you have too many nearly empty shards and you should use fewer shards per time period (or make your time period longer), or you should be looking at adding hardware.

elastic / elasticsearch

ES 1.4.2 random node disconnect #9212

9709 fixes #9683 too. Atleast the comments on the 9709 shows that. 9683 was closed in favor off 9709. Check out both of them, may be I am wrong ;)