cloudfoundry / prometheus-boshrelease

bosh release for prometheus ecosystem
Apache License 2.0
113 stars 163 forks source link

App Dashboards not working #110

Closed benjaminguttmann-avtq closed 7 years ago

benjaminguttmann-avtq commented 7 years ago

Hi there,

I think this is more a general question, I am currently facing the issue that the app dashboards are not working for me. I am testing Prometheus at the moment with an PCF 1.9 installation. Maybe a certain version of the CF release is needed to gather the needed information for the dashboards?

Because if I search Prometheus for the metrics used for the app dashboards it says either 'no data' or i could not find the metric at all.

Thanks for clarification.

mkuratczyk commented 7 years ago

Hi,

1.9 should work fine I think. Check the logs of all the exporters and cf_exporter in particular (/var/vcap/sys/log/_exporter/ on the Prometheus VM).

benjaminguttmann-avtq commented 7 years ago

Hi @mkuratczyk ,

I changed the log_level to debug, afterwards I can see the following error messages:

time="2017-08-17T08:37:59Z" level=error msg="Error while listing services: Error requesting services: Unable to decode body: invalid ch aracter '<' looking for beginning of value" source="services_collector.go:142"

time="2017-08-17T13:23:59Z" level=error msg="Error while listing security groups: Error requesting sec groups: Unable to decode body: json: cannot unmarshal number into Go value of type cfclient.CloudFoundryError" source="security_groups_collector.go:142"

Any idea what is wrong ? Appreciate your help.

mkuratczyk commented 7 years ago

This looks like a bug in the cf_exporter's Security Groups Collector (https://github.com/cloudfoundry-community/cf_exporter/blob/master/collectors/security_groups_collector.go).

@frodenas have you seen something like this before?

frodenas commented 7 years ago

@benjaminguttmann-avtq What bosh release version are you using? cf_exporter had a bug when parsing security groups, but that was fixed at v17.6.0.

benjaminguttmann-avtq commented 7 years ago

@frodenas According to the Pivotal Release Notes I am currently using the Bosh Director Version 260.6 and Prometheus Version v18.0.0.

frodenas commented 7 years ago

@benjaminguttmann-avtq thanks, there should be a security group with some special configuration that the go-cfclient library is unable to parse. Can you please send me the the rules of your security groups so I can reproduce it on my environment? If you don't want to paste them here, you can send it via email to my gmail address at frodenas. Thanks.

benjaminguttmann-avtq commented 7 years ago

This is the output of cf curl /v2/security_groups?inline-relations-depth=1

{ "total_results": 1, "total_pages": 1, "prev_url": null, "next_url": null, "resources": [ { "metadata": { "guid": "c7e2d7f1-234b-44bb-9e83-fd2372ea0379", "url": "/v2/security_groups/c7e2d7f1-234b-44bb-9e83-fd2372ea0379", "created_at": "2017-05-18T17:01:54Z", "updated_at": "2017-05-18T17:01:54Z" }, "entity": { "name": "default_security_group", "rules": [ { "protocol": "all", "destination": "0.0.0.0-169.253.255.255" }, { "protocol": "all", "destination": "169.255.0.0-255.255.255.255" } ], "running_default": true, "staging_default": true, "spaces_url": "/v2/security_groups/c7e2d7f1-234b-44bb-9e83-fd2372ea0379/spaces", "spaces": [] } } ] }

this is the security group

    [
        {
            "destination": "0.0.0.0-169.253.255.255",
            "protocol": "all"
        },
        {
            "destination": "169.255.0.0-255.255.255.255",
            "protocol": "all"
        }
    ]

It is just the default security group. When I am using the go-cfclient or the cf_exporter alone, from my local machine there aren't any errors ... or better I can't see any for the cf_exporter even with log_level debug it is only showing the

 INFO[0000] Starting cf_exporter (version=, branch=, revision=)  source="cf_exporter.go:189"
INFO[0000] Build context (go=go1.8.3, user=, date=)      source="cf_exporter.go:190"
INFO[0000] Listening on :9193                            source="cf_exporter.go:278"

message. Not even sure if it is doing anything :-/

frodenas commented 7 years ago

The cf_exporter only queries the cc when you ask for metrics, so at another terminal window you should run a curl localhost:9193/metrics. Please run that command and see if there's any error message.

benjaminguttmann-avtq commented 7 years ago

@frodenas : Thanks for that explanation, is was not aware of this. I ran the command but there was not error message :-/ Any other suggestions what to check?

frodenas commented 7 years ago

@benjaminguttmann-avtq I'm a bit confused now, let's start over. If you go to the prometheus query console and you query for the cf_exporter_application_info metric, does it show you anything?

benjaminguttmann-avtq commented 7 years ago

@frodenas : Okay. I am not quite sure why we got this error in the cf_explorer but it seems to work more or less. I can see the information in cf: summary dashboard for example.

But when I check the Apps: summary dashboard I can see that the following query is used in grafana: avg(firehose_container_metric_cpu_percentage{environment=~\"$environment\",bosh_deployment=~\"$bosh_deployment\",application_id=~\"$cf_application_id\"})"

I entered the query

avg(firehose_container_metric_cpu_percentage{environment="PCF19",bosh_deployment="cf",application_id="72397fa7-abeb-41df-83a6-2be7a495108b"})

into the prometheus query console and got the following result:

{} | 0.41178078827797354

but the value is not shown in the Grafana Dashboard. There it says: no data points

benjaminguttmann-avtq commented 7 years ago

Okay, we figured it out. It seems like there is a problem when the deployment name of the cf_exporter is not equal to 'cf'. As an example the dashboard 'Apps: System' is using bosh_deployment as a variable. In our case we renamed cf_exporter deployment_name to "pcf" which breaks this functionality as it always expects to be "cf". After hard coding the bosh_deployment in the query to cf it worked but the dashboard dropdown stopped working as soon as we changed the Grafana templating variable 'bosh_deployment' to use the job instead of the deployment. So we assume that we need both this variables here or we've to fix deployment_name to be "cf" always.

I think there are two possible solutions: 1 The deployment name in the cf exporter should be fixed to 'cf' 2 Add a new variable to templating section of the cf dashboards. and edit the firehose queries of the dashboard. What do you think is the best way to go here? Maybe it is also possible to make the deployment of the firehose editable. If it is okay for you, I will provide a PR for the preferred solution.

frodenas commented 7 years ago

ok, glad to hear you discovered the issue.

The deployment name should NOT be hardcoded to cf, because it's already configurable at the metron agent. In fact, there're users that tune this name at their deployments, and if you look at cf-deployment you'll see that it's using the system domain. PCF OTOH hardcodes this tocf.

The cf_exporter.cf.deployment_name property at the prometheus deployment MUST match the metron_agent.deployment property of your CF deployment. I'll update the monitor-cf.yml operator file to use the system_domain var instead of hardcoding it to cf.

frodenas commented 7 years ago

Fixed this at https://github.com/cloudfoundry-community/prometheus-boshrelease/commit/0bd45120b02d46f02e2fe93825423eb23ca5d3ef.

Please reopen if you still have any questions.