Closed knyar closed 6 years ago
We've been moving away from panic on error in the nozzle to prevent lossy periods of dropped buffers/reconnect time so I'd like to see a solution that solves it by refreshing whats stale.
We should evaluate if it's easier to fix in our code or in go-cfclient. This isn't a problem specific to our code but we could workaround it if it's messy upstream.
This should be resolved. Please re-open if you see this again.
We've seen a case of a refresh token used by the nozzle expiring, which resulted in the nozzle process never being able to reconnect to Firehose when it disconnects. Relevant log messages (human-readable timestamp in UTC prepended to each log message):
The refresh token (which I redacted) in this case had issue time of 1514893023 (Jan 2 11:37:03 UTC), so it was the same refresh token which got issued when the nozzle process started. I don't yet have a good understanding of how refresh tokens are supposed to be refreshed, but it clearly did not happen here.
The nasty part is that the nozzle remains in such (broken) state indefinitely and needs to be restarted manually.
Two possible workarounds come to mind:
cfClient.GetToken()
fails. This will probably require moving cfclient creation closer to firehose.go (which might be tricky, since the same client is also used in AppInfoRepository).cfClientTokenRefresh.RefreshAuthToken()
if a token cannot be refreshed several times in a row, making sure the process is restarted and all tokens are refreshed.@johnsonj, any thoughts?