Closed knyar closed 6 years ago
This is an excellent batch to combine.
Something along the lines of processStats
and using the current metric name (eg memoryStats.numMallocs
) as a label. We will need to document this because we are straying from reporting the metrics as they are named in Loggregator. A quick sketch in src/stackdriver-nozzle/docs/metric-naming.md
as part of the change?
We don't want to combine the different names into one metric, we want to drop the origin from the prefix so that e.g. we only have one "numCPUS" metric instead of 26 "foo.numCPUS" metrics.
There's one problem with this: we would need to keep "origin" as a label, because "job" is not unique enough. We can see this by sampling 20k (... this took a while, maybe don't sample 20k if you try this!) numCPUS metrics from the firehose and sorting them:
$ cf nozzle -n | grep numCPUS | head -20000 | cut -f 1,5 -d\ | sort | uniq -c
270 origin:"auctioneer" job:"diego_brain"
270 origin:"bbs" job:"diego_database"
270 origin:"cc_uploader" job:"diego_brain"
270 origin:"DopplerServer" job:"doppler"
270 origin:"etcd" job:"etcd_tls_server"
270 origin:"file_server" job:"diego_brain"
270 origin:"garden-linux" job:"diego_cell"
270 origin:"gorouter" job:"router"
270 origin:"locket" job:"diego_database"
270 origin:"LoggregatorTrafficController" job:"loggregator_trafficcontroller"
60 origin:"MetronAgent" job:"clock_global"
120 origin:"MetronAgent" job:"cloud_controller"
120 origin:"MetronAgent" job:"cloud_controller_worker"
180 origin:"MetronAgent" job:"consul_server"
180 origin:"MetronAgent" job:"diego_brain"
180 origin:"MetronAgent" job:"diego_cell"
180 origin:"MetronAgent" job:"diego_database"
180 origin:"MetronAgent" job:"doppler"
180 origin:"MetronAgent" job:"etcd_tls_server"
180 origin:"MetronAgent" job:"loggregator_trafficcontroller"
120 origin:"MetronAgent" job:"nats"
60 origin:"MetronAgent" job:"nfs_server"
180 origin:"MetronAgent" job:"router"
180 origin:"MetronAgent" job:"syslog_adapter"
60 origin:"MetronAgent" job:"syslog_scheduler"
60 origin:"MetronAgent" job:"tcp_router"
120 origin:"MetronAgent" job:"uaa"
270 origin:"netmon" job:"diego_cell"
270 origin:"nsync_bulker" job:"diego_brain"
270 origin:"nsync_listener" job:"diego_brain"
270 origin:"policy-server" job:"diego_database"
270 origin:"rep" job:"diego_cell"
270 origin:"route_emitter" job:"diego_cell"
180 origin:"routing_api" job:"cloud_controller"
270 origin:"silk-daemon" job:"diego_cell"
270 origin:"ssh-proxy" job:"diego_brain"
270 origin:"stager" job:"diego_brain"
270 origin:"tcp_emitter" job:"diego_brain"
90 origin:"tcp-router" job:"tcp_router"
270 origin:"tps_listener" job:"diego_brain"
270 origin:"tps_watcher" job:"diego_brain"
270 origin:"vxlan-policy-agent" job:"diego_cell"
This is pretty unfortunate: neither origin or job is uniquely identifying so we have to keep both.
Thanks! For mental mapping: origin is the process, job is the name of the VM it runs on. So a diego_cell VM runs a MetronAgent and a garden-linux.
It looks good to me. Be wary of running over the label size. Perhaps consider a combination of job/origin?
thank you @knyar !
Stackdriver has a limit of 500 custom metrics per project, and the latest build from
develop
branch already attempts to create more. As the result, SD API requests fail with the following error message:Note, #139 increased the number of metrics by prepending origin to the metric name. While it's the right thing to do in general, there are several metrics that seem to be created for multiple processes and seem to mean the same thing for all of them:
In our test PCF instance the 8 metrics listed above repeat 26 times each, so deduplicating them (by not prepending
origin
to metric name) will decrease the total number of metrics by 182. This seems like a quick easy win, but I suspect in the future we might also want to add metric blacklist/whitelist to give users better control of the number of metrics created by the nozzle.@johnsonj, what do you think?