avito-tech / bioyino

High performance and high-precision multithreaded StatsD server
The Unlicense
228 stars 22 forks source link

One node sends to carbon more rarely when other node is physically turned off. #39

Closed lexore closed 3 years ago

lexore commented 5 years ago

Hello. We have cluster of three node: node1, node2, node3. Let's say, that statsd metrics are sent only to node1. When we turn off server node2, node1 start to send to carbon more rarely - one time per 1-2 minutes. When i commented option "nodes" in block "[network]" on node1, all started to be ok. I think, problem near tcp timeouts when node1 try to communicate with node2 for exchange (or aggregation) metrics. node1, node2 and node3 in different networks. So, tcp timeouts not limited by arp request timeouts.

Here screenshot from graphite. Blue dots - it's all that's left from line. First time (11:50 - 12:05) - i try to understand, what happening. Second (12:11 - 12:17) - i commented "nodes" and trying to test the idea

firefox_2019-01-22_16-35-16

Config:

verbosity = "warn"
n-threads = 8
w-threads = 8
task-queue-size = 1024
start-as-leader = false
stats-interval = 10000
stats-prefix = "resources.monitoring.bioyino"
consensus = "internal"
[metrics]
count-updates = true
update-counter-prefix = "resources.monitoring.bioyino.updates"
update-counter-suffix = ""
update-counter-threshold = 200
fast-aggregation = false
[carbon]
address = "127.0.0.1:2003"
interval = 10000
connect-delay = 250
connect-delay-multiplier = 2
connect-delay-max = 10000
send-retries = 30
[network]
listen = "0.0.0.0:8125"
peer-listen = "0.0.0.0:8136"
mgmt-listen = "0.0.0.0:8137"
bufsize = 1500
multimessage = true
mm-packets = 100
mm-async = false
buffer-flush-time = 3000
buffer-flush-length = 65536
greens = 4
async-sockets = 4
nodes = ["192.168.2.2:8136", "192.168.3.3:8136"]
snapshot-interval = 1000
[raft]
start-delay = 5000
this-node = "192.168.1.1:8138"
nodes = {"192.168.1.1:8138" = 1, "192.168.2.2:8138" = 2, "192.168.3.3:8138" = 3}
[consul]
start-as = "disabled"
agent = "127.0.0.1:8500"
session-ttl = 11000
renew-time = 1000
key-name = "service/bioyino/lock"
Albibek commented 5 years ago

Could you try this on new version please to make sure the problem still exists?

Albibek commented 3 years ago

I beleive this has been fixed in 0.7.0. Feel free to open a new issue if this persists.