cloudfoundry-community / stackdriver-tools

Stackdriver Nozzle for Cloud Foundry Loggregator, Host Monitoring Agents BOSH Release
Apache License 2.0
21 stars 13 forks source link

Nozzle can't recover from an expired refresh token #177

Closed knyar closed 6 years ago

knyar commented 6 years ago

We've seen a case of a refresh token used by the nozzle expiring, which resulted in the nozzle process never being able to reconnect to Firehose when it disconnects. Relevant log messages (human-readable timestamp in UTC prepended to each log message):

2018-01-02T11:37:03.646067 {"timestamp":"1514893023.646067142","source":"stackdriver-nozzle","message":"stackdriver-nozzle.arguments","log_level":1,"data":{...}}
...nozzle started, working fine for a while. Then disconnect happens...
2018-01-03T20:13:53.360623 {"timestamp":"1515010433.360623360","source":"stackdriver-nozzle","message":"stackdriver-nozzle.firehose","log_level":2,"data":{"error":"read tcp [redacted]:51612-\u003e[redacted]:443: i/o timeout"}}
2018-01-03T20:13:53.886122 {"timestamp":"1515010433.886122227","source":"stackdriver-nozzle","message":"stackdriver-nozzle.firehose","log_level":2,"data":{"error":"Error getting bearer token: oauth2: cannot fetch token: 401 Unauthorized\nResponse: {\"error\":\"invalid_token\",\"error_description\":\"Invalid refresh token (expired): [redacted] expired at Tue Jan 02 20:37:03 UTC 2018\"}"}}
2018-01-03T20:13:54.922373 {"timestamp":"1515010434.922372818","source":"stackdriver-nozzle","message":"stackdriver-nozzle.firehose","log_level":2,"data":{"error":"Error getting bearer token: oauth2: cannot fetch token: 401 Unauthorized\nResponse: {\"error\":\"invalid_token\",\"error_description\":\"Invalid refresh token (expired): [redacted] expired at Tue Jan 02 20:37:03 UTC 2018\"}"}}
2018-01-03T20:13:56.946103 {"timestamp":"1515010436.946102858","source":"stackdriver-nozzle","message":"stackdriver-nozzle.firehose","log_level":2,"data":{"error":"Error getting bearer token: oauth2: cannot fetch token: 401 Unauthorized\nResponse: {\"error\":\"invalid_token\",\"error_description\":\"Invalid refresh token (expired): [redacted] expired at Tue Jan 02 20:37:03 UTC 2018\"}"}}

The refresh token (which I redacted) in this case had issue time of 1514893023 (Jan 2 11:37:03 UTC), so it was the same refresh token which got issued when the nozzle process started. I don't yet have a good understanding of how refresh tokens are supposed to be refreshed, but it clearly did not happen here.

The nasty part is that the nozzle remains in such (broken) state indefinitely and needs to be restarted manually.

Two possible workarounds come to mind:

@johnsonj, any thoughts?

johnsonj commented 6 years ago

We've been moving away from panic on error in the nozzle to prevent lossy periods of dropped buffers/reconnect time so I'd like to see a solution that solves it by refreshing whats stale.

We should evaluate if it's easier to fix in our code or in go-cfclient. This isn't a problem specific to our code but we could workaround it if it's messy upstream.

johnsonj commented 6 years ago

This should be resolved. Please re-open if you see this again.