Closed sysmonk closed 8 years ago
It's harmless as unused at moment but it should be a sane figure. I'll look into it.
When you say it's unused, you mean the 'load-balance' method doesn't work yet ? :(
It works mostly the same as ZeroMQ did. So there's just the problematic scenario with a single blocked Logstash causing the pipeline to intermittently pause while it times out. Also means a single slow Logstash could slow pipeline overall.
Simple back off will help massively. Then using latency stops more edge cases like a 10x slow Logstash from slowing entire pipeline and also allow payload splitting so it can just send smaller loads to slower servers. If you have same servers for Logstash it'll work great now. The latency will be for where servers differ in spec (mine wildly vary as I'm massively limited in capacity) so we need to send less overall to the lower spec servers.
So if all your Ligstash same spec, latency would be same across them anyway and it's the back off that'll help the most :) and v2 plugin I hope to get in status polling so it doesn't even try sending to blocked pipelines.
All of my servers are same-ish in spec (per DC), but i'm not using zmq due to the issues with it (only using it on like 1% of hosts). And when using the tcp(tls) method, the connections aren't split evenly through logstash nodes.
Also, I'm in situation where out of 4000 nodes, around 30 generate 80% of the traffic, so there's a big possibility of those 30 nodes being connected to same-ish logstash nodes.
I'm planning to add even more and more logs ( doing over 80k events/s to ES now, planning to tripple/quadrupple it), so i'm really really waiting for a good load-balancing solution in this case.
Hi,
Another thing i noticed is the strange latency times reported by log-courier:
I.e. event f7670b9a26ad566153f124d9e10d935d sent at 16:32:43.660376, acked at 16:32:43.664203. That's 3827 ms apart, ant latency reported is 4487848290723224576.000000. What is it measured in?