apache / trafficcontrol

Apache Traffic Control is an Open Source implementation of a Content Delivery Network
https://trafficcontrol.apache.org/
Apache License 2.0
1.07k stars 344 forks source link

Riak cluster cleanup on cluster rebuid #3030

Closed traeak closed 3 years ago

traeak commented 5 years ago

In traffic_ops_golang, whenever a riak cluster is torn down it isn't being stopped in a totally correct way according to the basho riak client package.

When a cluster stops executing we see in the error log per cluster server:

riak error: 2018/10/03 07:41:43.841775 [ERROR] [connectionManager] error when closing connection in stop() write tcp localhost:53188->cluster_server:8087: i/o timeout

The riak pooled cluster greatly mitigates this error but doesn't remove it. Previously a defer function was being called to stop cluster execution for each and every query, spamming the logs with the above message(s).

The riak pooled cluster attaches a similar stop cluster function to the go garbage collector via (riak_services.go):

runtime.SetFinalizer(sharedCluster, sharedCluster.Stop())

Now this "error" only prints when the riak cluster machines are changed in TODB.

This message in error.log can be recreated by using traffic portal to offline (or online) a riak cluster server and then perform an api query against anything ssl key related.

rob05c commented 5 years ago

I missed the SetFinalizer call being added. That needs removed. Go "finalizers" should be used very, very carefully, and very, very sparingly. They don't behave predictably, and have all kinds of subtle nuances.

For example, they're all run in a single goroutine; does sharedCluster.Stop() return very quickly? I doubt it. Then it blocks every other finalizer in the entire application (potentially in libraries we use, and aren't even aware of).

That's just one example. The fact is, the Go garbage collector simply wasn't designed for finalizers. Go doesn't do RAII. And runtime.SetFinalizer is subtle and dangerous, and should be avoided if at all possible. Code that needs to "destruct" should be refactored to do so reliably and predictably.

In particular, is there any reason not to change that to

go func() {
  if err := sharedCluster.Stop(); err != nil {
    log.Errorln("Stopping Riak cluster: " + err.Error())
  }
}()

?

traeak commented 5 years ago

I don't mind removing it/changing it but the above statement from the old code kicks out the same errors that the finalizer function does, all from the internal basho riak client.

rawlinp commented 3 years ago

Closing this as the Riak Traffic Vault implementation is going to be deleted as a whole, so minor improvements to it are not really prudent at this time.