Open discotab311 opened 6 years ago
Sorry for so many questions!
I love questions!
Thanks!
From what you've said so far, I think your bottleneck is at the receivers. If that is correct, here are some things you can try with the receivers:
Add "numsimultaneousrequests:8" to your receiver props file. This allows more transactions to be in flight to ES. By default it is 4. You can bump it up to a maximum of 10 (which is tied to an un-configurable thread pool size used by the low-level REST client). Under load, you can expect to lose data with values over 10. Also, when you do this, ES will definitely get busier.
If that doesn't fix things, you can add "useindexthread:true". True was supposed to be the default. I think it will give you some additional improvement, although I suspect that biggest gain would be if you had a lot of XSP logs to parse.
If that still doesn't do it, one thing we're looking into recommending in an upcoming scaling guide is to migrate the log receiver(s) out of the ES cluster. Collocation is fine for a minimal deployment (which is what our documentation is assuming), but a heavy load exacerbates the fact that they compete for CPU.
I hope that helps!
Those updates have helped a ton! One other thing I did was to increase my log size from 30MB to 100.
It seems to be keeping up much better now. I'll really know in a few hours when we hit peak.
One other question is this new error I am seeing in the receiver log:
2018-05-21_09:53:49.963 [I/O dispatcher 5] INFO c.b.e.ElasticLogIndexerThreadownerImpl - onFailure exception - numreqs= 5 after 30118 java.net.SocketTimeoutException: null at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:375) at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92) at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39) at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175) at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:263) at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:492) at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:213) at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280) at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588) at java.lang.Thread.run(Thread.java:748) 2018-05-21_09:53:50.830 [I/O dispatcher 6] INFO c.b.e.ElasticLogIndexerThreadownerImpl - onFailure exception - numreqs= 5 after 30179 java.net.SocketTimeoutException: null at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:375) at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92) at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39) at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175) at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:263) at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:492) at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:213) at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280) at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588) at java.lang.Thread.run(Thread.java:748) 2018-05-21_09:54:01.710 [I/O dispatcher 8] INFO c.b.e.ElasticLogIndexerThreadownerImpl - onSuccess response from ES numreqs= 5 indexcount 25000 - took 13141 2018-05-21_09:54:02.811 [I/O dispatcher 7] INFO c.b.e.ElasticLogIndexerThreadownerImpl - onSuccess response from ES numreqs= 5 indexcount 25000 - took 14731 2018-05-21_09:54:04.259 [I/O dispatcher 9] INFO c.b.e.ElasticLogIndexerThreadownerImpl - onSuccess response from ES numreqs= 5 indexcount 25000 - took 15255 2018-05-21_09:54:07.992 [I/O dispatcher 12] INFO c.b.e.ElasticLogIndexerThreadownerImpl - onSuccess response from ES numreqs= 5 indexcount 25000 - took 17162
Under the circumstances, it looks like it is complaining about the bulk transaction occasionally taking too long. How often does this happen? Those are some pretty high transaction times. Thirty seconds is definitely over the default timeout value (which we don't currently have as configurable).
What system load do you see in ES? This may sound like a crazy question, but when things are at their worst, do you see a lot of CPU usage by the "kjournald" process on your ES machines?
I don't see any crazy usage on kjournald
.
The http error is pretty consistent along all of the nodes the receiver lives on. There does not seem to be any spikes within the jvm or system memory or cpu.
I might not be reading these right. Are all of these your ingest node? I'm fairly unfamiliar with ingest nodes, but if I understand them correctly, I wouldn't expect them to appear to be all that busy under the circumstances unless you have assigned pipeline transformations to them.
How busy are the data nodes that are doing the indexing? If the data nodes' CPUs are not under pressure, but they are taking 15 seconds to index 25K documents, it sounds like you might be having network or disk contention. Do you know what your total indexing rate is for your cluster? How many documents per minute?
When you deployed to a multi-node cluster, did you change the number of shards and replicas in the index templates?
On another note, what VM hosting software are you using, and how do you have your guest VMs deployed across your VM hosts?
On a single node cluster we got significant performance gains by adding physical (in our case, even rotational did well) disks to reduce disk contention as well as increase storage size. ES will allocate indices across different paths configured in node.path,
If you want to try the same thing in a virtual environment, you would need to be sure your virtual disks are distributed across multiple physical disks.
I have the following setup:
4 ES Ingest Nodes running the Receiver, 3 Master Nodes and 3 Data Nodes. I've noticed the Sender is lagging behind, quite a bit at times. ES 6.2.4 with the 1020 apps
I'm looking for the best way to identify whether the bottleneck is IO, the receiver or the sender. In the receiver logs, all i get is:
2018-05-17_18:51:06.727 [LogProcessor #10] INFO c.b.e.ElasticLogIndexerZeroThreadImpl - hit max number of requests 4
senderreceiverusessl:false logprocessorqueuesize:200 logprocessornumthreads:16 jvmheapsize:1024m JAVA_PATH:/usr/bin ESAUTHUSER:NOAUTH ESAUTHPASS:NOAUTH kafkaserver:None kafkaserverport:9092 kafkastopicname:None kafkasgroupname:None usekafka:false
Each ingest and data node has 18 CPU and 48 GB RAM, with 50% designated to ES heap.
Thanks!