Closed anthonysroberts closed 6 years ago
Can you also provide the text of the Fatal error?
The backtrace starts here, but because the error is being pulled from a channel the trace is basically useless for figuring out what went wrong :-(
Is this whats needed. There is a websocket close - unexpected EOF message
TL;DR: Nozzle error handling in 1.0.2 is a little rudimentary, it commits suicide in lieu of attempting to manage its own disconnection/reconnection to the firehose internally.
This is fixed in develop as of (I think) #129, #143 and #151. You are almost certainly losing a small amount of firehose data to these restarts, but the loss should be very minor since the Nozzle restarts and reconnects to the firehose quickly.
if you're using 1.0.2 instead of the latest 1.0.6 release, I assume running bleeding-edge code isn't something you're interested in? If you do want to live a little dangerously, we are in the process of stabilizing the development branch for a new release, and beta testers with real PCF deployments would be very useful!
Though -- if you are getting an EOF from the firehose every minute, something else might be wrong, that doesn't seem like normal behaviour. Can you tail the firehose with e.g. cf nozzle -n
for a reasonable length of time, or does that see similar disconnects? (You may need to install https://github.com/cloudfoundry-community/firehose-plugin).
On the first point. We were using v1.0.6 until a month ago when post a cf deployment upgrade the nozzle stopped sending any data to stack driver. So we reverted to 1.0.2
We are not adverse to running beta although running open cf rather than pcf. We would be more than happy to help with testing.
I will tail the nozzle and come back. Thanks for the help to this point.
Re the Nozzle not sending data to Stackdriver: I think we have encountered that too, if it's the same issue the fix is #178, which is actually https://github.com/cloudfoundry-community/go-cfclient/pull/163 and https://github.com/cloudfoundry-community/go-cfclient/pull/164.
There are some caveats to beta testing: we have made significant changes to the Nozzle's behaviour which would likely mean losing continuity for timeseries data. At the moment we believe that it will be necessary to delete the old metric descriptors from Stackdriver when switching over, because otherwise it is possible that you will run into Stackdriver's limit of 500 custom metric descriptors per project. You may be able to get around this by having the beta nozzle create its timeseries in a separate stackdriver project, or by using the new filtering functionality to restrict the set of timeseries that are derived from firehose data.
If you're still interested I can reach out early next week with something that ought to be at least "release candidate" quality. Is the AOL address attached to your Github account a good place to contact you?
Yes please contact via the Github email. We are maturing our monitoring capability and stackdriver (and its nozzle) is at the heart of it all so we are still interested in taking a new release (we don't have any historic data that we care about at the moment).
good morning, the implementation of the beta release candidate has fully resolved the issue logged here, we no longer see the nozzle panics. Please close and thank you.
That's good news, thanks :-)
Panic.txt Good morning,
We are running CF-Deployment 1.12.0, Stackdriver-tools 1.0.2 on GCP.
We continually see the following panic message which results in the stackdriver process failing, monit continually restarts. We still see metrics and logs flowing into stackdriver itself (so not 100% sure if we are experiencing any data loss at this moment). Note this has been occurring for some time (not linked to any specific version of CF-Deployment).