Closed johnsonj closed 6 years ago
If it fixes a bug then I guess this is fine, but why is there a connection retry limit? All you're doing when committing suicide with Fatal() is resetting that limit to zero. This seems like a band-aid to work around poor client behaviour.
The default retry limit is 1000:
https://github.com/cloudfoundry/noaa/blob/master/consumer/async.go#L23
But it can be changed:
https://github.com/cloudfoundry/noaa/blob/master/consumer/async.go#L47
Unfortunately we can't set it to e.g. 0 to have infinite retries:
https://github.com/cloudfoundry/noaa/blob/master/consumer/async.go#L328
It looks like the retry counter is not being reset by a successful reconnection. A successful connection will also not reset the sleep time, which means that after 8 disconnects + reconnects -- for the other 992 reconnects -- the nozzle is waiting 1 minute between reconnects and effectively dropping firehose data during that time.
https://github.com/cloudfoundry/noaa/blob/master/consumer/async.go#L239
This seems like a bug in the noaa consumer. I can see an argument for limiting the number of contiguous unsuccessful retries, but if the client library successfully reconnects, both the counter and retry delay should be reset.
Taking a step back, why is the nozzle getting disconnected from the firehose so regularly? ISTR seeing some form of websocket timeout errors in the nozzle's logs. There may be another bug here too...
Review status: 0 of 3 files reviewed at latest revision, all discussions resolved, some commit checks failed.
Comments from Reviewable
but I filed https://github.com/cloudfoundry/noaa/issues/37 because we shouldn't have to do this.
Review status: 0 of 3 files reviewed at latest revision, all discussions resolved, some commit checks failed.
Comments from Reviewable
Reviewed 3 of 3 files at r1. Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed.
Comments from Reviewable
Thanks @fluffle ! Agreed on the noaa bug and need to investigate why this is happening.
The NOAA folks correctly pointed out that I'd missed the counters being reset in the callback run on connect. So something else is wrong, which is frustrating because the symptoms match counters not being reset.
Having said that, with the latest nozzle builds we have on our PCF instance I can't see any websocket disconnections, so I wonder if the disconnects were an artifact of reducing the batch size to 1, and thus taking ~2-3 minutes to send all timeseries to Stackdriver.
A test with the batch size reduced to 1 did not result in the nozzle getting behind, but this may be because it is no longer able to post all the metrics. We have a work-around for now, so I guess I'll wait and see if we have any suicides.
If we reach the max number of connection retries we need to do something besides spin. Log them as fatal errors which will cause a crash.
Fixes #149
/cc @knyar @fluffle
This change is