log-courier 2.x: strange latency times?

sysmonk commented 8 years ago

Hi,

Another thing i noticed is the strange latency times reported by log-courier:

2016/03/14 16:31:43.660511 Timeout timer reset - due at 2016-03-14 16:46:43.660468572 +0000 UTC
2016/03/14 16:31:43.660586 Registrar received offsets for 3 log entries
2016/03/14 16:32:13.658183 Spooler flushing 3 events due to spool timeout exceeded
2016/03/14 16:32:13.658355 [logstash4.example.com:9001] 3 new events, sending to endpoint
2016/03/14 16:32:13.658427 [logstash4.example.com:9001] Sending payload de3d5481cdfa531267aa655aa5ab0117
2016/03/14 16:32:13.662671 Timeout timer reset - due at 2016-03-14 16:33:13.66262503 +0000 UTC
2016/03/14 16:32:13.663208 [logstash4.example.com:9001] Send is now ready, awaiting new events
2016/03/14 16:32:13.663965 [logstash4.example.com:9001] Acknowledgement received for payload de3d5481cdfa531267aa655aa5ab0117 sequence 3
2016/03/14 16:32:13.664057 [logstash4.example.com:9001] Average latency per event: 3821602192617030144.000000
2016/03/14 16:32:13.664119 Timeout timer reset - due at 2016-03-14 16:47:13.664099511 +0000 UTC
2016/03/14 16:32:13.664251 Registrar received offsets for 3 log entries
2016/03/14 16:32:28.659121 Spooler flushing 1 events due to spool timeout exceeded
2016/03/14 16:32:28.659289 [logstash4.example.com:9001] 1 new events, sending to endpoint
2016/03/14 16:32:28.659365 [logstash4.example.com:9001] Sending payload ce6bf8b0858d8953de5425676ad71230
2016/03/14 16:32:28.660674 Timeout timer reset - due at 2016-03-14 16:33:28.660641691 +0000 UTC
2016/03/14 16:32:28.660900 [logstash4.example.com:9001] Send is now ready, awaiting new events
2016/03/14 16:32:28.693236 [logstash4.example.com:9001] Acknowledgement received for payload ce6bf8b0858d8953de5425676ad71230 sequence 1
2016/03/14 16:32:28.693333 [logstash4.example.com:9001] Average latency per event: 4800776944328073216.000000
2016/03/14 16:32:28.693410 Timeout timer reset - due at 2016-03-14 16:47:28.693388133 +0000 UTC
2016/03/14 16:32:28.693497 Registrar received offsets for 1 log entries
2016/03/14 16:32:43.660079 Spooler flushing 3 events due to spool timeout exceeded
2016/03/14 16:32:43.660302 [logstash4.example.com:9001] 3 new events, sending to endpoint
2016/03/14 16:32:43.660376 [logstash4.example.com:9001] Sending payload f7670b9a26ad566153f124d9e10d935d
2016/03/14 16:32:43.662853 Timeout timer reset - due at 2016-03-14 16:33:43.662797137 +0000 UTC
2016/03/14 16:32:43.663145 [logstash4.example.com:9001] Send is now ready, awaiting new events
2016/03/14 16:32:43.664203 [logstash4.example.com:9001] Acknowledgement received for payload f7670b9a26ad566153f124d9e10d935d sequence 3
2016/03/14 16:32:43.664641 [logstash4.example.com:9001] Average latency per event: 4487848290723224576.000000
2016/03/14 16:32:43.664907 Timeout timer reset - due at 2016-03-14 16:47:43.664879752 +0000 UTC
2016/03/14 16:32:43.665115 Registrar received offsets for 3 log entries

I.e. event f7670b9a26ad566153f124d9e10d935d sent at 16:32:43.660376, acked at 16:32:43.664203. That's 3827 ms apart, ant latency reported is 4487848290723224576.000000. What is it measured in?

driskell commented 8 years ago

It's harmless as unused at moment but it should be a sane figure. I'll look into it.

sysmonk commented 8 years ago

When you say it's unused, you mean the 'load-balance' method doesn't work yet ? :(

driskell commented 8 years ago

It works mostly the same as ZeroMQ did. So there's just the problematic scenario with a single blocked Logstash causing the pipeline to intermittently pause while it times out. Also means a single slow Logstash could slow pipeline overall.

Simple back off will help massively. Then using latency stops more edge cases like a 10x slow Logstash from slowing entire pipeline and also allow payload splitting so it can just send smaller loads to slower servers. If you have same servers for Logstash it'll work great now. The latency will be for where servers differ in spec (mine wildly vary as I'm massively limited in capacity) so we need to send less overall to the lower spec servers.

driskell commented 8 years ago

So if all your Ligstash same spec, latency would be same across them anyway and it's the back off that'll help the most :) and v2 plugin I hope to get in status polling so it doesn't even try sending to blocked pipelines.

sysmonk commented 8 years ago

All of my servers are same-ish in spec (per DC), but i'm not using zmq due to the issues with it (only using it on like 1% of hosts). And when using the tcp(tls) method, the connections aren't split evenly through logstash nodes.

Also, I'm in situation where out of 4000 nodes, around 30 generate 80% of the traffic, so there's a big possibility of those 30 nodes being connected to same-ish logstash nodes.

I'm planning to add even more and more logs ( doing over 80k events/s to ES now, planning to tripple/quadrupple it), so i'm really really waiting for a good load-balancing solution in this case.

driskell / log-courier

log-courier 2.x: strange latency times? #290