Nozzle can't recover from an expired refresh token

knyar commented 6 years ago

We've seen a case of a refresh token used by the nozzle expiring, which resulted in the nozzle process never being able to reconnect to Firehose when it disconnects. Relevant log messages (human-readable timestamp in UTC prepended to each log message):

2018-01-02T11:37:03.646067 {"timestamp":"1514893023.646067142","source":"stackdriver-nozzle","message":"stackdriver-nozzle.arguments","log_level":1,"data":{...}}
...nozzle started, working fine for a while. Then disconnect happens...
2018-01-03T20:13:53.360623 {"timestamp":"1515010433.360623360","source":"stackdriver-nozzle","message":"stackdriver-nozzle.firehose","log_level":2,"data":{"error":"read tcp [redacted]:51612-\u003e[redacted]:443: i/o timeout"}}
2018-01-03T20:13:53.886122 {"timestamp":"1515010433.886122227","source":"stackdriver-nozzle","message":"stackdriver-nozzle.firehose","log_level":2,"data":{"error":"Error getting bearer token: oauth2: cannot fetch token: 401 Unauthorized\nResponse: {\"error\":\"invalid_token\",\"error_description\":\"Invalid refresh token (expired): [redacted] expired at Tue Jan 02 20:37:03 UTC 2018\"}"}}
2018-01-03T20:13:54.922373 {"timestamp":"1515010434.922372818","source":"stackdriver-nozzle","message":"stackdriver-nozzle.firehose","log_level":2,"data":{"error":"Error getting bearer token: oauth2: cannot fetch token: 401 Unauthorized\nResponse: {\"error\":\"invalid_token\",\"error_description\":\"Invalid refresh token (expired): [redacted] expired at Tue Jan 02 20:37:03 UTC 2018\"}"}}
2018-01-03T20:13:56.946103 {"timestamp":"1515010436.946102858","source":"stackdriver-nozzle","message":"stackdriver-nozzle.firehose","log_level":2,"data":{"error":"Error getting bearer token: oauth2: cannot fetch token: 401 Unauthorized\nResponse: {\"error\":\"invalid_token\",\"error_description\":\"Invalid refresh token (expired): [redacted] expired at Tue Jan 02 20:37:03 UTC 2018\"}"}}

The refresh token (which I redacted) in this case had issue time of 1514893023 (Jan 2 11:37:03 UTC), so it was the same refresh token which got issued when the nozzle process started. I don't yet have a good understanding of how refresh tokens are supposed to be refreshed, but it clearly did not happen here.

The nasty part is that the nozzle remains in such (broken) state indefinitely and needs to be restarted manually.

Two possible workarounds come to mind:

Like suggested in https://github.com/cloudfoundry-community/go-cfclient/issues/34, recreate the cfclient from scratch when cfClient.GetToken() fails. This will probably require moving cfclient creation closer to firehose.go (which might be tricky, since the same client is also used in AppInfoRepository).
Just panic in cfClientTokenRefresh.RefreshAuthToken() if a token cannot be refreshed several times in a row, making sure the process is restarted and all tokens are refreshed.

@johnsonj, any thoughts?

johnsonj commented 6 years ago

We've been moving away from panic on error in the nozzle to prevent lossy periods of dropped buffers/reconnect time so I'd like to see a solution that solves it by refreshing whats stale.

We should evaluate if it's easier to fix in our code or in go-cfclient. This isn't a problem specific to our code but we could workaround it if it's messy upstream.

johnsonj commented 6 years ago

This should be resolved. Please re-open if you see this again.

cloudfoundry-community / stackdriver-tools

Nozzle can't recover from an expired refresh token #177