Closed a-robinson closed 6 years ago
I think we can remove the every minute logging of gossip connectivity. The other logging only occurs when gossip connectivity changes. It shouldn't be changing frequently and we should figure out why it is.
Yeah, I'll check out what's going on with gossip after the tpc-c runs are done, or more likely I'll just spin up a cluster using smaller nodes later on to focus on the gossip in a more cost effective way.
However, there will always be a lot of logging triggered every time one or more nodes get restarted as the gossip network adjusts to the change(s) and stabilizes, which will certainly add up during an event like a rolling upgrade.
On a 3 node cluster we gossip at ~2KB/sec. On a 9 node cluster we gossip at ~12KB/sec. On a 27 node cluster we gossip at ~55KB/sec. None of these numbers scale up to 37GB gossiped over 2 hours. I think there is something to investigate there.
I think there is something to investigate there.
We both gossip the cluster-id as well as include it in every gossip.Request
. That seems excessive, though I doubt it has a significant impact.
On a 3 node cluster, over the course of 1 minute we send/receive the following keys via gossip:
gossip-clients:1 138 6348
cluster-id 57 3363
cluster-id 53 3339
first-range 35 2870
sentinel 38 2394
gossip-clients:3 45 2115
store:1 6 2100
gossip-clients:2 35 1680
liveness:2 15 975
liveness:3 13 845
liveness:1 11 715
store:3 2 708
store:2 2 708
The second column is the number of times the key was sent/received. The third column is the size of the gossiped value times the number of times it was gossiped (providing an indication of what fraction of the gossiped bytes it is consuming).
Note that the sizes above are only measuring gossip.Info
. In my testing, this only accounts for ~40% of the gossip traffic. The other 60% is gossip.{Request,Response}
overhead. I suspect most of the extra data is in gossip.HighWaterStamps
, though I haven't confirmed that.
We need to gossip the sentinel frequently, but there is no need to be gossiping the cluster-id and first-range so fast.
The numbers I posted above were the sent/received gossip infos from a single node. The math inclined will notice that we sent/received 138 gossip-clients:1 keys. I looked into this a bit more and it turns out we sent 138 keys during that minute time frame. But something is off here because we are only supposed to send that key every 2 seconds so we should have gossiped it 60 times. A bit more instrumentation shows that every time AddInfo
is called for that key, we send it two each connected node twice and sometimes three times. I'm not sure why. Definitely a bug.
I'm pretty sure the problem is due to how we update highWaterStamps
. For server connections, we only update highWaterStamps
upon receipt of a new request from the remote node. But that means that if we have a new info to send, we'll keep sending it until the remote node sends another request. I'm pretty sure the same problem occurs for the client-side of the connection. Rather than waiting for an update from the remote node, I think we need to update our view of the remote's highWaterStamps as soon as we send them something.
See https://github.com/cockroachdb/cockroach/issues/30126. I think that is an explanation for the gigabytes of gossip traffic.
@a-robinson Do you think there is anything else to address here for 2.1? My understanding is that gossip stabilized on your 100+ node cluster and this spam just affected startup of the cluster.
Unfortunately, that doesn't seem to be the case. I saved the latest log file from the cluster before taking it down, and all the log files have a lot of the connectivity logs even though no nodes were being added/removed around that time. Two nodes were down due to the bad_alloc issue, but they had been down for well over an hour before the time period these logs cover. Let me know if you'd like to see them for yourself.
I imagine the cull
logic in gossip that closes the least useful connection is causing a lot of this, and at a high level I'm not sure that connectivity logging on changes can happily coexist with the culling of an outgoing connection from every node every minute, or at least not on large clusters.
Also, orthogonally from the thrashing, the logging during startup is excessive. We log every time we get new connectivity info that isn't identical to the last thing we logged, but during startup a node can get gossip info piece by piece, leading to log output like the below that keeps going on and on until all the nodes' info is received:
I180918 20:02:51.829369 55 gossip/client.go:129 [n?] started gossip client to 35.229.39.61:26257
I180918 20:02:51.836827 168 gossip/gossip.go:942 [n?] gossip connectivity
n47 [sentinel];
I180918 20:02:51.837244 168 gossip/gossip.go:942 [n?] gossip connectivity
n47 [sentinel];
n32 -> n95; n32 -> n96; n32 -> n122; n32 -> n132; n32 -> n135;
I180918 20:02:51.837609 168 gossip/gossip.go:942 [n?] gossip connectivity
n47 [sentinel];
n32 -> n95; n32 -> n96; n32 -> n122; n32 -> n132; n32 -> n135; n66 -> n27; n66 -> n55; n66 -> n56; n66 -> n88; n66 -> n102; n81 -> n9; n81 -> n12; n81 -> n66; n81 -> n76; n81 -> n87;
I180918 20:02:51.837920 168 gossip/gossip.go:942 [n?] gossip connectivity
n47 [sentinel];
n32 -> n95; n32 -> n96; n32 -> n122; n32 -> n132; n32 -> n135; n37 -> n8; n37 -> n64; n37 -> n65; n37 -> n97; n37 -> n101; n52 -> n5; n52 -> n19; n52 -> n68; n52 -> n79; n52 -> n106; n55 -> n12; n55 -> n58; n55 -> n68; n55 -> n75; n55 -> n96; n66 -> n27; n66 -> n55; n66 -> n56; n66 -> n88; n66 -> n102; n71 -> n43; n71 -> n75; n71 -> n87; n71 -> n88; n71 -> n128; n81 -> n9; n81 -> n12; n81 -> n66; n81 -> n76; n81 -> n87; n92 -> n10; n92 -> n26; n92 -> n75; n92 -> n94; n92 -> n132; n106 -> n2; n106 -> n10; n106 -> n61; n106 -> n70; n106 -> n93;
I180918 20:02:51.838131 168 gossip/gossip.go:942 [n?] gossip connectivity
n47 [sentinel];
n32 -> n95; n32 -> n96; n32 -> n122; n32 -> n132; n32 -> n135; n37 -> n8; n37 -> n64; n37 -> n65; n37 -> n97; n37 -> n101; n52 -> n5; n52 -> n19; n52 -> n68; n52 -> n79; n52 -> n106; n55 -> n12; n55 -> n58; n55 -> n68; n55 -> n75; n55 -> n96; n66 -> n27; n66 -> n55; n66 -> n56; n66 -> n88; n66 -> n102; n71 -> n43; n71 -> n75; n71 -> n87; n71 -> n88; n71 -> n128; n81 -> n9; n81 -> n12; n81 -> n66; n81 -> n76; n81 -> n87; n92 -> n10; n92 -> n26; n92 -> n75; n92 -> n94; n92 -> n132; n104 -> n9; n104 -> n98; n104 -> n110; n104 -> n115; n104 -> n128; n106 -> n2; n106 -> n10; n106 -> n61; n106 -> n70; n106 -> n93;
I180918 20:02:51.838544 168 gossip/gossip.go:942 [n?] gossip connectivity
n47 [sentinel];
n7 -> n37; n7 -> n40; n7 -> n42; n7 -> n71; n7 -> n78; n26 -> n29; n26 -> n58; n26 -> n118; n26 -> n120; n26 -> n127; n32 -> n95; n32 -> n96; n32 -> n122; n32 -> n132; n32 -> n135; n37 -> n8; n37 -> n64; n37 -> n65; n37 -> n97; n37 -> n101; n47 -> n11; n47 -> n13; n47 -> n23; n47 -> n76; n47 -> n78; n52 -> n5; n52 -> n19; n52 -> n68; n52 -> n79; n52 -> n106; n55 -> n12; n55 -> n58; n55 -> n68; n55 -> n75; n55 -> n96; n66 -> n27; n66 -> n55; n66 -> n56; n66 -> n88; n66 -> n102; n71 -> n43; n71 -> n75; n71 -> n87; n71 -> n88; n71 -> n128; n81 -> n9; n81 -> n12; n81 -> n66; n81 -> n76; n81 -> n87; n92 -> n10; n92 -> n26; n92 -> n75; n92 -> n94; n92 -> n132; n104 -> n9; n104 -> n98; n104 -> n110; n104 -> n115; n104 -> n128; n106 -> n2; n106 -> n10; n106 -> n61; n106 -> n70; n106 -> n93; n116 -> n12; n116 -> n18; n116 -> n69; n116 -> n73; n116 -> n76;
I180918 20:02:51.838867 168 gossip/gossip.go:942 [n?] gossip connectivity
n47 [sentinel];
n7 -> n37; n7 -> n40; n7 -> n42; n7 -> n71; n7 -> n78; n26 -> n29; n26 -> n58; n26 -> n118; n26 -> n120; n26 -> n127; n30 -> n4; n30 -> n31; n30 -> n34; n30 -> n63; n30 -> n79; n32 -> n95; n32 -> n96; n32 -> n122; n32 -> n132; n32 -> n135; n37 -> n8; n37 -> n64; n37 -> n65; n37 -> n97; n37 -> n101; n45 -> n37; n45 -> n46; n45 -> n58; n45 -> n77; n45 -> n85; n47 -> n11; n47 -> n13; n47 -> n23; n47 -> n76; n47 -> n78; n52 -> n5; n52 -> n19; n52 -> n68; n52 -> n79; n52 -> n106; n55 -> n12; n55 -> n58; n55 -> n68; n55 -> n75; n55 -> n96; n66 -> n27; n66 -> n55; n66 -> n56; n66 -> n88; n66 -> n102; n71 -> n43; n71 -> n75; n71 -> n87; n71 -> n88; n71 -> n128; n72 -> n17; n72 -> n18; n72 -> n77; n72 -> n90; n72 -> n130; n81 -> n9; n81 -> n12; n81 -> n66; n81 -> n76; n81 -> n87; n92 -> n10; n92 -> n26; n92 -> n75; n92 -> n94; n92 -> n132; n94 -> n42; n94 -> n43; n94 -> n115; n94 -> n129; n94 -> n134; n104 -> n9; n104 -> n98; n104 -> n110; n104 -> n115; n104 -> n128; n106 -> n2; n106 -> n10; n106 -> n61; n106 -> n70; n106 -> n93; n116 -> n12; n116 -> n18; n116 -> n69; n116 -> n73; n116 -> n76; n130 -> n4; n130 -> n11; n130 -> n19; n130 -> n20; n130 -> n65;
I180918 20:02:51.839272 168 gossip/gossip.go:942 [n?] gossip connectivity
n47 [sentinel];
n7 -> n37; n7 -> n40; n7 -> n42; n7 -> n71; n7 -> n78; n26 -> n29; n26 -> n58; n26 -> n118; n26 -> n120; n26 -> n127; n30 -> n4; n30 -> n31; n30 -> n34; n30 -> n63; n30 -> n79; n32 -> n95; n32 -> n96; n32 -> n122; n32 -> n132; n32 -> n135; n37 -> n8; n37 -> n64; n37 -> n65; n37 -> n97; n37 -> n101; n45 -> n37; n45 -> n46; n45 -> n58; n45 -> n77; n45 -> n85; n47 -> n11; n47 -> n13; n47 -> n23; n47 -> n76; n47 -> n78; n52 -> n5; n52 -> n19; n52 -> n68; n52 -> n79; n52 -> n106; n55 -> n12; n55 -> n58; n55 -> n68; n55 -> n75; n55 -> n96; n63 -> n17; n63 -> n36; n63 -> n39; n63 -> n101; n63 -> n119; n66 -> n27; n66 -> n55; n66 -> n56; n66 -> n88; n66 -> n102; n71 -> n43; n71 -> n75; n71 -> n87; n71 -> n88; n71 -> n128; n72 -> n17; n72 -> n18; n72 -> n77; n72 -> n90; n72 -> n130; n81 -> n9; n81 -> n12; n81 -> n66; n81 -> n76; n81 -> n87; n92 -> n10; n92 -> n26; n92 -> n75; n92 -> n94; n92 -> n132; n94 -> n42; n94 -> n43; n94 -> n115; n94 -> n129; n94 -> n134; n104 -> n9; n104 -> n98; n104 -> n110; n104 -> n115; n104 -> n128; n106 -> n2; n106 -> n10; n106 -> n61; n106 -> n70; n106 -> n93; n116 -> n12; n116 -> n18; n116 -> n69; n116 -> n73; n116 -> n76; n130 -> n4; n130 -> n11; n130 -> n19; n130 -> n20; n130 -> n65;
...
I've been tossing around the idea of rate limiting this logging to happen no more than once a minute, which would alleviate most of the pain. Would that still accomplish most of the original goal? I realize we've discovered some gossip problems lately, but I'm not really convinced that this information belongs in non-verbose logs.
It's true that the gossip connectivity logs might be addressing yesterday's war. Perhaps we should re-add them to the once per minute logging, but only output the message if it has changed since the last log line.
I spun up a 128 node cluster for tpc-c 50k testing (it was meant to be / will soon be 120, but my brain accidentally rounded up when creating it), and the gossip connectivity logs are overwhelming. First of all, they're huge, taking up almost my entire screen with just one log line:
Secondly, they're being logged a few times per second, making it impossible to read anything else out of the logs.
The second point is probably indicative of some amount of instability in the gossip network that should be investigated on its own right. However, even if it was only being logged once a minute as part of the gossip status, this would be quite annoying. The entire "gossip status" message literally doesn't fit on the screen of my 15" macbook:
Also, wow that's a lot of data being transmitted over gossip (the server indicates 25GB sent / 12GB received). I suspect a lot of it is the connectivity map being resent in its entirety over and over.
It'd be reasonable to say we should first understand/fix the fact that there are frequent changes happening in the gossip network. I'm still worried about how things will look after that, though.
@petermattis