concourse / hush-house

Concourse k8s-based environment
https://hush-house.pivotal.io
29 stars 23 forks source link

ci: remove datadog; oc-agent scrapes prometheus #134

Closed jamieklassen closed 3 years ago

jamieklassen commented 3 years ago

accordingly we should see metrics appearing in wavefront.

jamieklassen commented 3 years ago

there might be a way to do this more gracefully, by sending metrics to both datadog and wavefront simultaneously for a period of time. Converting to a draft until I can get a better description.

EDIT: before I forget - the general principle was to install an extra opentelemetry contrib collector sidecar into the web pods, built with a small patch to enable the datadog exporter, then have that collector also scrape the prometheus endpoint and forward to datadog, using a config file somewhat like:

receivers:
  prometheus:
    config:
      global:
        scrape_interval: '5s'
        evaluation_interval: '5s'
      scrape_configs:
        - job_name: 'concourse'
          static_configs:
            - targets:
              - 'localhost:9391'
exporters:
  datadog:
    api:
      key: '<datadog api key from a secret... so this file should be generated by an init container>'
    metrics:
      namespace: 'concourse.ci'
service:
  pipelines:
    metrics:
      receivers: [prometheus]
      exporters: [datadog]

I lightly tested this and the behaviour is not identical to concourse's native datadog integration -- a key difference is that the extra attributes from the CONCOURSE_METRICS_ATTRIBUTE parameter do not appear in the prometheus emitter's output, which ultimately means that the data reaching datadog lacks the environment: label. It may be possible to resolve this using the metrics transform processor available in the contrib collector.

EDIT 2: also @vito pointed out that there may be some non-local effects of having the datadog helm chart installed here -- in particular, it may be responsible for the existence of datadog agents on all the generic-1 nodes, and so hush-house might rely on it in order to export its metrics. So maybe we shouldn't be too hasty about removing the chart.

jamieklassen commented 3 years ago

Interestingly telegraf has exporters for both datadog and wavefront, so I think I will experiment with it now.

EDIT: I'm feeling pretty good about the ability to use prometheus + telegraf + datadog output plugin to get data in datadog that is pretty much on par with concourse's native datadog integration, except that prometheus prefixes all the metric names with concourse_ since it also exposes non-application-specific meters, and the prometheus input plugin adds suffixes for some metrics (e.g. concourse's internal name is builds started, but prometheus exports it as builds_started_total). So I think we'll need to make some (hopefully not too difficult) changes to https://github.com/concourse/greenpeace/blob/master/terraform/dashboard/main.tf (sorry for the private repo!) in order for our existing datadog dashboards to work.

jamieklassen commented 3 years ago

OK i took a new approach that I like much better. Not totally sure if it will work, but @xtreme-sameer-vohra you can sanity check it and then we could try deploying and debugging/fixing things up together.

xtreme-sameer-vohra commented 3 years ago

I haven't tested it, however since reverting is quite straightforward, I am leaning towards being expedient and trying it out on CI as per your suggestion.

xtremerui commented 3 years ago

closing this as metrics of hush-house are now forwarding to wavefront https://vmware.wavefront.com/dashboards/concourse-team-temp