Troubleshooting Receiver/Sender bottleneck

discotab311 commented 6 years ago

I have the following setup:

4 ES Ingest Nodes running the Receiver, 3 Master Nodes and 3 Data Nodes. I've noticed the Sender is lagging behind, quite a bit at times. ES 6.2.4 with the 1020 apps

I'm looking for the best way to identify whether the bottleneck is IO, the receiver or the sender. In the receiver logs, all i get is:

2018-05-17_18:51:06.727 [LogProcessor #10] INFO c.b.e.ElasticLogIndexerZeroThreadImpl - hit max number of requests 4

senderreceiverusessl:false logprocessorqueuesize:200 logprocessornumthreads:16 jvmheapsize:1024m JAVA_PATH:/usr/bin ESAUTHUSER:NOAUTH ESAUTHPASS:NOAUTH kafkaserver:None kafkaserverport:9092 kafkastopicname:None kafkasgroupname:None usekafka:false

Each ingest and data node has 18 CPU and 48 GB RAM, with 50% designated to ES heap.

Thanks!

bchibbard commented 6 years ago

How many senders do you have, and how many receivers do you have?
Are the receivers running on the same host machines as your ES cluster?
About how often do you see that "hit max requests" log?
How would you characterize your mix of logs? For example 40% AS logs, 60% XSP logs
When it falls behind, do you know how far behind it falls? In other words, if you are looking for the most recent log for some server, what is the difference between that timestamp and the wall clock time?
Does it catch up? How long does that take?
Are you using SSDs or HDDs for your ES cluster?

Sorry for so many questions!

discotab311 commented 6 years ago

I love questions!

Each ingest node is running a receiver. In this case, there are two primary AS's and one NS talking to the receiver.
Yes
Constant
90% AS logs, 10% NS logs
Varies. Right now it is about 40 min, but yesterday it was close to 2 hours at one point.
Unknown at the moment. We generate a 30M XS log every 40 seconds.
HDD in a vm environment.

Thanks!

bchibbard commented 6 years ago

Just to clarify, you have 2 AS and 1 NS per receiver, is that right?
What system load averages do you see across your ES cluster nodes when it is falling behind?
Also, do you see some logs in the receiver with text like "...onSuccess response from ES... - took ..."? The number after "took" is how long the ES bulk index request took in milliseconds. Obviously, this number fluctuates quite a bit, and gets large if ES is struggling. When things are at their worst, what are typical numbers you see there?

From what you've said so far, I think your bottleneck is at the receivers. If that is correct, here are some things you can try with the receivers:

Add "numsimultaneousrequests:8" to your receiver props file. This allows more transactions to be in flight to ES. By default it is 4. You can bump it up to a maximum of 10 (which is tied to an un-configurable thread pool size used by the low-level REST client). Under load, you can expect to lose data with values over 10. Also, when you do this, ES will definitely get busier.

If that doesn't fix things, you can add "useindexthread:true". True was supposed to be the default. I think it will give you some additional improvement, although I suspect that biggest gain would be if you had a lot of XSP logs to parse.

If that still doesn't do it, one thing we're looking into recommending in an upcoming scaling guide is to migrate the log receiver(s) out of the ES cluster. Collocation is fine for a minimal deployment (which is what our documentation is assuming), but a heavy load exacerbates the fact that they compete for CPU.

I hope that helps!

discotab311 commented 6 years ago

Those updates have helped a ton! One other thing I did was to increase my log size from 30MB to 100.

It seems to be keeping up much better now. I'll really know in a few hours when we hit peak.

One other question is this new error I am seeing in the receiver log:

2018-05-21_09:53:49.963 [I/O dispatcher 5] INFO c.b.e.ElasticLogIndexerThreadownerImpl - onFailure exception - numreqs= 5 after 30118 java.net.SocketTimeoutException: null at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:375) at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92) at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39) at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175) at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:263) at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:492) at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:213) at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280) at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588) at java.lang.Thread.run(Thread.java:748) 2018-05-21_09:53:50.830 [I/O dispatcher 6] INFO c.b.e.ElasticLogIndexerThreadownerImpl - onFailure exception - numreqs= 5 after 30179 java.net.SocketTimeoutException: null at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:375) at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92) at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39) at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175) at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:263) at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:492) at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:213) at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280) at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588) at java.lang.Thread.run(Thread.java:748) 2018-05-21_09:54:01.710 [I/O dispatcher 8] INFO c.b.e.ElasticLogIndexerThreadownerImpl - onSuccess response from ES numreqs= 5 indexcount 25000 - took 13141 2018-05-21_09:54:02.811 [I/O dispatcher 7] INFO c.b.e.ElasticLogIndexerThreadownerImpl - onSuccess response from ES numreqs= 5 indexcount 25000 - took 14731 2018-05-21_09:54:04.259 [I/O dispatcher 9] INFO c.b.e.ElasticLogIndexerThreadownerImpl - onSuccess response from ES numreqs= 5 indexcount 25000 - took 15255 2018-05-21_09:54:07.992 [I/O dispatcher 12] INFO c.b.e.ElasticLogIndexerThreadownerImpl - onSuccess response from ES numreqs= 5 indexcount 25000 - took 17162

bchibbard commented 6 years ago

Under the circumstances, it looks like it is complaining about the bulk transaction occasionally taking too long. How often does this happen? Those are some pretty high transaction times. Thirty seconds is definitely over the default timeout value (which we don't currently have as configurable).

What system load do you see in ES? This may sound like a crazy question, but when things are at their worst, do you see a lot of CPU usage by the "kjournald" process on your ES machines?

discotab311 commented 6 years ago

I don't see any crazy usage on kjournald cacti_graph_49000 cacti_graph_49001 cacti_graph_49247 cacti_graph_49244 cacti_graph_49245

.

The http error is pretty consistent along all of the nodes the receiver lives on. There does not seem to be any spikes within the jvm or system memory or cpu.

bchibbard commented 6 years ago

I might not be reading these right. Are all of these your ingest node? I'm fairly unfamiliar with ingest nodes, but if I understand them correctly, I wouldn't expect them to appear to be all that busy under the circumstances unless you have assigned pipeline transformations to them.

How busy are the data nodes that are doing the indexing? If the data nodes' CPUs are not under pressure, but they are taking 15 seconds to index 25K documents, it sounds like you might be having network or disk contention. Do you know what your total indexing rate is for your cluster? How many documents per minute?

When you deployed to a multi-node cluster, did you change the number of shards and replicas in the index templates?

On another note, what VM hosting software are you using, and how do you have your guest VMs deployed across your VM hosts?

On a single node cluster we got significant performance gains by adding physical (in our case, even rotational did well) disks to reduce disk contention as well as increase storage size. ES will allocate indices across different paths configured in node.path,

node.path: /dev/disk1/clustername, /dev/disk2/clustername, ...

If you want to try the same thing in a virtual environment, you would need to be sure your virtual disks are distributed across multiple physical disks.

BroadSoft-Xtended / BroadWorks-Dashboards-and-Discovery

Troubleshooting Receiver/Sender bottleneck #46