Deduplicate process-level metrics

knyar commented 6 years ago

Stackdriver has a limit of 500 custom metrics per project, and the latest build from develop branch already attempts to create more. As the result, SD API requests fail with the following error message:

rpc error: code = ResourceExhausted desc = Your metric descriptor quota has been exhausted

Note, #139 increased the number of metrics by prepending origin to the metric name. While it's the right thing to do in general, there are several metrics that seem to be created for multiple processes and seem to mean the same thing for all of them:

memoryStats.lastGCPauseTimeNS
memoryStats.numBytesAllocated
memoryStats.numBytesAllocatedHeap
memoryStats.numBytesAllocatedStack
memoryStats.numFrees
memoryStats.numMallocs
numCPUS
numGoRoutines

In our test PCF instance the 8 metrics listed above repeat 26 times each, so deduplicating them (by not prepending origin to metric name) will decrease the total number of metrics by 182. This seems like a quick easy win, but I suspect in the future we might also want to add metric blacklist/whitelist to give users better control of the number of metrics created by the nozzle.

@johnsonj, what do you think?

johnsonj commented 6 years ago

This is an excellent batch to combine.

Something along the lines of processStats and using the current metric name (eg memoryStats.numMallocs) as a label. We will need to document this because we are straying from reporting the metrics as they are named in Loggregator. A quick sketch in src/stackdriver-nozzle/docs/metric-naming.md as part of the change?

fluffle commented 6 years ago

We don't want to combine the different names into one metric, we want to drop the origin from the prefix so that e.g. we only have one "numCPUS" metric instead of 26 "foo.numCPUS" metrics.

There's one problem with this: we would need to keep "origin" as a label, because "job" is not unique enough. We can see this by sampling 20k (... this took a while, maybe don't sample 20k if you try this!) numCPUS metrics from the firehose and sorting them:

$ cf nozzle -n | grep numCPUS | head -20000 | cut -f 1,5 -d\  | sort | uniq -c
    270 origin:"auctioneer" job:"diego_brain"
    270 origin:"bbs" job:"diego_database"
    270 origin:"cc_uploader" job:"diego_brain"
    270 origin:"DopplerServer" job:"doppler"
    270 origin:"etcd" job:"etcd_tls_server"
    270 origin:"file_server" job:"diego_brain"
    270 origin:"garden-linux" job:"diego_cell"
    270 origin:"gorouter" job:"router"
    270 origin:"locket" job:"diego_database"
    270 origin:"LoggregatorTrafficController" job:"loggregator_trafficcontroller"
     60 origin:"MetronAgent" job:"clock_global"
    120 origin:"MetronAgent" job:"cloud_controller"
    120 origin:"MetronAgent" job:"cloud_controller_worker"
    180 origin:"MetronAgent" job:"consul_server"
    180 origin:"MetronAgent" job:"diego_brain"
    180 origin:"MetronAgent" job:"diego_cell"
    180 origin:"MetronAgent" job:"diego_database"
    180 origin:"MetronAgent" job:"doppler"
    180 origin:"MetronAgent" job:"etcd_tls_server"
    180 origin:"MetronAgent" job:"loggregator_trafficcontroller"
    120 origin:"MetronAgent" job:"nats"
     60 origin:"MetronAgent" job:"nfs_server"
    180 origin:"MetronAgent" job:"router"
    180 origin:"MetronAgent" job:"syslog_adapter"
     60 origin:"MetronAgent" job:"syslog_scheduler"
     60 origin:"MetronAgent" job:"tcp_router"
    120 origin:"MetronAgent" job:"uaa"
    270 origin:"netmon" job:"diego_cell"
    270 origin:"nsync_bulker" job:"diego_brain"
    270 origin:"nsync_listener" job:"diego_brain"
    270 origin:"policy-server" job:"diego_database"
    270 origin:"rep" job:"diego_cell"
    270 origin:"route_emitter" job:"diego_cell"
    180 origin:"routing_api" job:"cloud_controller"
    270 origin:"silk-daemon" job:"diego_cell"
    270 origin:"ssh-proxy" job:"diego_brain"
    270 origin:"stager" job:"diego_brain"
    270 origin:"tcp_emitter" job:"diego_brain"
     90 origin:"tcp-router" job:"tcp_router"
    270 origin:"tps_listener" job:"diego_brain"
    270 origin:"tps_watcher" job:"diego_brain"
    270 origin:"vxlan-policy-agent" job:"diego_cell"

This is pretty unfortunate: neither origin or job is uniquely identifying so we have to keep both.

johnsonj commented 6 years ago

Thanks! For mental mapping: origin is the process, job is the name of the VM it runs on. So a diego_cell VM runs a MetronAgent and a garden-linux.

It looks good to me. Be wary of running over the label size. Perhaps consider a combination of job/origin?

johnsonj commented 6 years ago

thank you @knyar !

cloudfoundry-community / stackdriver-tools

Deduplicate process-level metrics #157