cloudfoundry-community / stackdriver-tools

Stackdriver Nozzle for Cloud Foundry Loggregator, Host Monitoring Agents BOSH Release
Apache License 2.0
21 stars 13 forks source link

nozzle: crash on MaxRetriesReached #151

Closed johnsonj closed 6 years ago

johnsonj commented 6 years ago

If we reach the max number of connection retries we need to do something besides spin. Log them as fatal errors which will cause a crash.

Fixes #149

/cc @knyar @fluffle


This change is Reviewable

fluffle commented 6 years ago

If it fixes a bug then I guess this is fine, but why is there a connection retry limit? All you're doing when committing suicide with Fatal() is resetting that limit to zero. This seems like a band-aid to work around poor client behaviour.

The default retry limit is 1000:

https://github.com/cloudfoundry/noaa/blob/master/consumer/async.go#L23

But it can be changed:

https://github.com/cloudfoundry/noaa/blob/master/consumer/async.go#L47

Unfortunately we can't set it to e.g. 0 to have infinite retries:

https://github.com/cloudfoundry/noaa/blob/master/consumer/async.go#L328

It looks like the retry counter is not being reset by a successful reconnection. A successful connection will also not reset the sleep time, which means that after 8 disconnects + reconnects -- for the other 992 reconnects -- the nozzle is waiting 1 minute between reconnects and effectively dropping firehose data during that time.

https://github.com/cloudfoundry/noaa/blob/master/consumer/async.go#L239

This seems like a bug in the noaa consumer. I can see an argument for limiting the number of contiguous unsuccessful retries, but if the client library successfully reconnects, both the counter and retry delay should be reset.

Taking a step back, why is the nozzle getting disconnected from the firehose so regularly? ISTR seeing some form of websocket timeout errors in the nozzle's logs. There may be another bug here too...


Review status: 0 of 3 files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

fluffle commented 6 years ago

:LGTM: but I filed https://github.com/cloudfoundry/noaa/issues/37 because we shouldn't have to do this.


Review status: 0 of 3 files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

fluffle commented 6 years ago

Reviewed 3 of 3 files at r1. Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

johnsonj commented 6 years ago

Thanks @fluffle ! Agreed on the noaa bug and need to investigate why this is happening.

fluffle commented 6 years ago

The NOAA folks correctly pointed out that I'd missed the counters being reset in the callback run on connect. So something else is wrong, which is frustrating because the symptoms match counters not being reset.

Having said that, with the latest nozzle builds we have on our PCF instance I can't see any websocket disconnections, so I wonder if the disconnects were an artifact of reducing the batch size to 1, and thus taking ~2-3 minutes to send all timeseries to Stackdriver.

fluffle commented 6 years ago

A test with the batch size reduced to 1 did not result in the nozzle getting behind, but this may be because it is no longer able to post all the metrics. We have a work-around for now, so I guess I'll wait and see if we have any suicides.