Closed alexvasseur closed 2 years ago
@avasseur-pivotal It'd be interesting to see you BOSH VM configuration (VM CPU, memory, disk), because we never ran into this issue, even when using the default OM director configuration. Bear in mind also, that the bosh_exporter
FAQ recommends increasing the default scrape interval
and the scrape timeout
, but again something is really wrong with that BOSH when it takes so much time to fetch the VM states. It might happen that some VM's agents are unresponsive, but that should not take longer that 45 seconds. Why are so many tasks queued is also a mystery, I'll need to dig into the bosh logs to see what might be happening.
This release doesn't set any exporter by default, it's up to you to decide which exporter you want to use. Although the boshhmforwarder
is an option (outlined at the bosh_exporter FAQ), I'm reluctant to use it on my deployments because you then have a dependency on the CF Firehose. What we have used in some deployments is the graphite_exporter
and the BOSH Graphite Health Monitor plugin, but I understand that you cannot use this configuration when using OM (because there is no option to enable that plugin).
Also, I want to point that I've never been happy with the bosh_exporter
querying bosh and creating so many tasks, but it was the only way to fetch the VM IP addresses in order to use the service discovey. Some recent additions (https://github.com/cloudfoundry/bosh/commit/6a70432f66440bb34015b14639a117a24557fe0b) to bosh have made this task more feasible, so I plan to refactor the bosh_exporter
to use a different mechanism to gather both the VM IP addresses and metrics.
Hi @frodenas,
The problem that @avasseur-pivotal describe is the same as I explain to you some time ago by email.
When I checked, I saw that one of the bosh worker process was stuck and stop working on the queue. Task were just piling up and the cpu was stuck.. Even when prometheus was stop, I had to restart all bosh process to recover.
I increase the scrape interval to 5m even 10m to be able to solve the issue.
I am working on the tsdb collector that I need to test it out, but I guess it should be possible to do the same as the graphite one.
@avasseur-pivotal @shinji62 Have you guys opened an issue at the Bosh repo? I'd interesting to know why the worker processes consumes so much cpu and finally got stuck.
Just a quick update. The plan here is to use:
1) the bosh_tsdb_exporter to gather vm metrics. This exporter will receive metrics from the BOSH OpenTSDB Health Monitor plugin and will be compatible with OpsManager (it only allows you to configure this plugin). This exporter doesn't hit the BOSH API (and it will not generate a Task), so the problem stated in this issue will be mitigated.
2) the bosh_exporter
will be responsible to only gather administrative info (like releases, stemcells, ... being used) and the VM's IPs (but this will require a director version >= 261). To get this info, the exporter will not need to generate a task, so this will also help to mitigate the problem stated in this issue.
I'm still finishing 2) and doing some tests. After that, I'll update this release and the associated dashboards and alerts.
@avasseur-pivotal just checking in to see if this is still happening; thanks
I'll throw it out that I have a customer experiencing this issue. I can gather more information if needed. We've gone with the workaround of deploying the TSDB exporter, but they would like to collect the administrative info if the bosh_exporter
is refactored.
refactoring the exporter is not currently possible. Gathering info about release and stemcells is easy, but gathering processes is still a challenge.
If you're experiencing high cpu load at your bosh director, can you please open an issue at the bosh repo? It does not make any sense that director consumes so much cpu just to gather info from vms.
Running against a small PCF 1.9, the bosh_exporter scrape interval at 30sec is really causing bosh task queueing as expected but that is impacting the bosh user experience
Moving to a scrap interval of 10min changes this fully, but is likely to impact alerting on bosh healthy messages. I am planning to change and default to use BoshHMforwarder. On PCF the ECS team has made that easy with a tile - http://www.ecsteam.com/deploying-bosh-health-metrics-forwarder-pivotal-cloud-foundry-tile I would think defaulting this bosh release to using boshhmforwarder (even without a tile, bringing its own as part of this release) would be a wiser choice