Stackdriver-nozzle repeat panic (every 30 seconds)

anthonysroberts commented 6 years ago

Panic.txt Good morning,

We are running CF-Deployment 1.12.0, Stackdriver-tools 1.0.2 on GCP.

We continually see the following panic message which results in the stackdriver process failing, monit continually restarts. We still see metrics and logs flowing into stackdriver itself (so not 100% sure if we are experiencing any data loss at this moment). Note this has been occurring for some time (not linked to any specific version of CF-Deployment).

fluffle commented 6 years ago

Can you also provide the text of the Fatal error?

https://github.com/cloudfoundry-community/stackdriver-tools/blob/v1.0.2/src/stackdriver-nozzle/main.go#L59

The backtrace starts here, but because the error is being pulled from a channel the trace is basically useless for figuring out what went wrong :-(

anthonysroberts commented 6 years ago

issue 194 Panic.txt

Is this whats needed. There is a websocket close - unexpected EOF message

fluffle commented 6 years ago

TL;DR: Nozzle error handling in 1.0.2 is a little rudimentary, it commits suicide in lieu of attempting to manage its own disconnection/reconnection to the firehose internally.

This is fixed in develop as of (I think) #129, #143 and #151. You are almost certainly losing a small amount of firehose data to these restarts, but the loss should be very minor since the Nozzle restarts and reconnects to the firehose quickly.

if you're using 1.0.2 instead of the latest 1.0.6 release, I assume running bleeding-edge code isn't something you're interested in? If you do want to live a little dangerously, we are in the process of stabilizing the development branch for a new release, and beta testers with real PCF deployments would be very useful!

fluffle commented 6 years ago

Though -- if you are getting an EOF from the firehose every minute, something else might be wrong, that doesn't seem like normal behaviour. Can you tail the firehose with e.g. cf nozzle -n for a reasonable length of time, or does that see similar disconnects? (You may need to install https://github.com/cloudfoundry-community/firehose-plugin).

anthonysroberts commented 6 years ago

On the first point. We were using v1.0.6 until a month ago when post a cf deployment upgrade the nozzle stopped sending any data to stack driver. So we reverted to 1.0.2

We are not adverse to running beta although running open cf rather than pcf. We would be more than happy to help with testing.

I will tail the nozzle and come back. Thanks for the help to this point.

fluffle commented 6 years ago

Re the Nozzle not sending data to Stackdriver: I think we have encountered that too, if it's the same issue the fix is #178, which is actually https://github.com/cloudfoundry-community/go-cfclient/pull/163 and https://github.com/cloudfoundry-community/go-cfclient/pull/164.

There are some caveats to beta testing: we have made significant changes to the Nozzle's behaviour which would likely mean losing continuity for timeseries data. At the moment we believe that it will be necessary to delete the old metric descriptors from Stackdriver when switching over, because otherwise it is possible that you will run into Stackdriver's limit of 500 custom metric descriptors per project. You may be able to get around this by having the beta nozzle create its timeseries in a separate stackdriver project, or by using the new filtering functionality to restrict the set of timeseries that are derived from firehose data.

If you're still interested I can reach out early next week with something that ought to be at least "release candidate" quality. Is the AOL address attached to your Github account a good place to contact you?

anthonysroberts commented 6 years ago

Yes please contact via the Github email. We are maturing our monitoring capability and stackdriver (and its nozzle) is at the heart of it all so we are still interested in taking a new release (we don't have any historic data that we care about at the moment).

anthonysroberts commented 6 years ago

good morning, the implementation of the beta release candidate has fully resolved the issue logged here, we no longer see the nozzle panics. Please close and thank you.

fluffle commented 6 years ago

That's good news, thanks :-)

cloudfoundry-community / stackdriver-tools

Stackdriver-nozzle repeat panic (every 30 seconds) #194