high cpu usage on director and slow bosh CLI with bosh_exporter

alexvasseur commented 7 years ago

Running against a small PCF 1.9, the bosh_exporter scrape interval at 30sec is really causing bosh task queueing as expected but that is impacting the bosh user experience

in the bosh director, top reports 50% of cpu usage for user
in the bosh cli, this takes close to 2 min to do a "bosh vms" across 5 bosh releases The FAQ is not so clear about this Attached queues example

Moving to a scrap interval of 10min changes this fully, but is likely to impact alerting on bosh healthy messages. I am planning to change and default to use BoshHMforwarder. On PCF the ECS team has made that easy with a tile - http://www.ecsteam.com/deploying-bosh-health-metrics-forwarder-pivotal-cloud-foundry-tile I would think defaulting this bosh release to using boshhmforwarder (even without a tile, bringing its own as part of this release) would be a wiser choice

frodenas commented 7 years ago

@avasseur-pivotal It'd be interesting to see you BOSH VM configuration (VM CPU, memory, disk), because we never ran into this issue, even when using the default OM director configuration. Bear in mind also, that the bosh_exporter FAQ recommends increasing the default scrape interval and the scrape timeout, but again something is really wrong with that BOSH when it takes so much time to fetch the VM states. It might happen that some VM's agents are unresponsive, but that should not take longer that 45 seconds. Why are so many tasks queued is also a mystery, I'll need to dig into the bosh logs to see what might be happening.

This release doesn't set any exporter by default, it's up to you to decide which exporter you want to use. Although the boshhmforwarder is an option (outlined at the bosh_exporter FAQ), I'm reluctant to use it on my deployments because you then have a dependency on the CF Firehose. What we have used in some deployments is the graphite_exporter and the BOSH Graphite Health Monitor plugin, but I understand that you cannot use this configuration when using OM (because there is no option to enable that plugin).

Also, I want to point that I've never been happy with the bosh_exporter querying bosh and creating so many tasks, but it was the only way to fetch the VM IP addresses in order to use the service discovey. Some recent additions (https://github.com/cloudfoundry/bosh/commit/6a70432f66440bb34015b14639a117a24557fe0b) to bosh have made this task more feasible, so I plan to refactor the bosh_exporter to use a different mechanism to gather both the VM IP addresses and metrics.

shinji62 commented 7 years ago

Hi @frodenas,

The problem that @avasseur-pivotal describe is the same as I explain to you some time ago by email.

When I checked, I saw that one of the bosh worker process was stuck and stop working on the queue. Task were just piling up and the cpu was stuck.. Even when prometheus was stop, I had to restart all bosh process to recover.

I increase the scrape interval to 5m even 10m to be able to solve the issue.

I am working on the tsdb collector that I need to test it out, but I guess it should be possible to do the same as the graphite one.

frodenas commented 7 years ago

@avasseur-pivotal @shinji62 Have you guys opened an issue at the Bosh repo? I'd interesting to know why the worker processes consumes so much cpu and finally got stuck.

frodenas commented 7 years ago

Just a quick update. The plan here is to use:

1) the bosh_tsdb_exporter to gather vm metrics. This exporter will receive metrics from the BOSH OpenTSDB Health Monitor plugin and will be compatible with OpsManager (it only allows you to configure this plugin). This exporter doesn't hit the BOSH API (and it will not generate a Task), so the problem stated in this issue will be mitigated.

2) the bosh_exporter will be responsible to only gather administrative info (like releases, stemcells, ... being used) and the VM's IPs (but this will require a director version >= 261). To get this info, the exporter will not need to generate a task, so this will also help to mitigate the problem stated in this issue.

I'm still finishing 2) and doing some tests. After that, I'll update this release and the associated dashboards and alerts.

alexanelli commented 7 years ago

@avasseur-pivotal just checking in to see if this is still happening; thanks

lancefrench commented 7 years ago

I'll throw it out that I have a customer experiencing this issue. I can gather more information if needed. We've gone with the workaround of deploying the TSDB exporter, but they would like to collect the administrative info if the bosh_exporter is refactored.

frodenas commented 7 years ago

refactoring the exporter is not currently possible. Gathering info about release and stemcells is easy, but gathering processes is still a challenge.

If you're experiencing high cpu load at your bosh director, can you please open an issue at the bosh repo? It does not make any sense that director consumes so much cpu just to gather info from vms.

cloudfoundry / prometheus-boshrelease

high cpu usage on director and slow bosh CLI with bosh_exporter #43