cloudfoundry / noaa

NOAA is a client library to consume metric and log messages from Doppler.
Apache License 2.0
23 stars 23 forks source link

Connection retry counter and backoff delay should be reset on successful connection. #37

Closed fluffle closed 7 years ago

fluffle commented 7 years ago

Hi,

I come here from https://github.com/cloudfoundry-community/stackdriver-tools/pull/151, where code is being added to the stackdriver nozzle to force it to commit suicide via lager.Fatal when the connection retry limit is reached.

For reasons as yet unknown (guess what I'm doing after I have submit here), the nozzle in our test installation of PCF gets disconnected from the firehose pretty regularly, but is always able to reconnect. It uses the consumer's inbuilt retry logic to do this. Because the counter and backoff are not reset, after 8 such disconnections (and with 992 to go until suicide) the nozzle is waiting 1 minute between reconnects, and effectively dropping firehose data.

I can see good reasons for having a limit on the number of consecutive unsuccessful retry attempts, but users of noaa/consumer should not have to Fatal to reset these counters. We could manage our own retries, but then why bother having the functionality in noaa/consumer at all?

Simply setting the retry limit to 2^63-1 is not enough because then consecutive retries are not limited and the backoff quickly plateaus at 1 minute, and setting the max backoff to 500ms breaks exponential backoff when retrying unsuccessful connections.

/cc @johnsonj @knyar

cf-gitbot commented 7 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/152718148

The labels on this github issue will be updated when the story is started.

bradylove commented 7 years ago

@fluffle we have investigated your issue and found that you are using an old version of NOAA. We added retry counter resets back in February. See https://github.com/cloudfoundry/noaa/commit/c38eb201

If updating does not solve your issue, feel free to re-open the issue and we can discuss further steps.

fluffle commented 7 years ago

Hi @bradylove,

No, I think I missed the resets being done in the connect callback, sorry! The strange thing is, we have c38eb20 in our vendor tree but our jobs seem to be behaving as if the counters weren't reset. I think I have more debugging to do :-(