BCDevOps / developer-experience

This repository is used to track all work for the BCGov Platform Services Team (This includes work for: 1. Platform Experience, 2. Developer Experience 3. Platform Operations/OCP 3)
Apache License 2.0
8 stars 17 forks source link

Document how to add Prometheus instrumentation to an app #2925

Closed StevenBarre closed 2 years ago

StevenBarre commented 2 years ago

Describe the issue Convert my slide deck from the recent Community Meetup into a docs page in beta-docs

Additional context See my slide deck

How does this benefit the users of our platform? Demonstrate how to add instrumentation to apps for custom metrics

Definition of done Page published on https://beta-docs.developer.gov.bc.ca/

tmorik commented 2 years ago

Start drafting doc here: https://github.com/bcgov/platform-developer-docs/pull/145

tmorik commented 2 years ago

Checking Get metrics in Sysdig.

tmorik commented 2 years ago

Re: Get metrics in Sysdig, I will summarize Sending StatsD Metrics doc, which has been used for statsd_MCS_XXXX custom metrics in the Sysdig. -- It is collected by MCS's nagios monitoring and sent to Sysdig.

image.png

StevenBarre commented 2 years ago

@ShellyXueHan can you help with getting sysdig to scrape custom prometheus endpoints?

tmorik commented 2 years ago

Thanks to Steven, it looks like we just need to have an annotation below.

prometheus.io/scrape=true

Doc: https://docs.sysdig.com/en/docs/sysdig-monitor/monitoring-integrations/custom-integrations/collect-prometheus-metrics/#agent-compatibility

tmorik commented 2 years ago

.. and Sysdig agent has already the Prometheus setting in its configmap:

From CCM's template cm-sysdig-agent.yaml.j2

<...>
    ### Prometheus
    # enable that the metrics being scrapped is mapped to the application container instead of sysdig agent container
    promscrape_fastproto: true
    prometheus:
      enabled: true
      prom_service_discovery: true
      interval: 30
      log_errors: true
      # max_metrics: 3000 (defualt set to 8000)
      histograms: false
ShellyXueHan commented 2 years ago

@tmorik were you able to get the metrics available from sysdig? anything i can help with still?

tmorik commented 2 years ago

@ShellyXueHan, Yes, I got metrics in sysdig like below;

Using our openshift-bcgov-perfmon namespace. I added prometheus.io/scrape: true annotation to the pod. Then sysdig is stating scraping the metrics which that pods is collecting.

KLAB/openshift-bcgov-perfmon ~ $ oc get pods -o wide
NAME                       READY   STATUS    RESTARTS   AGE   IP            NODE                  NOMINATED NODE   READINESS GATES
perfmon-5576c95b44-6nd5j   1/1     Running   0          97m   10.97.13.65   mcs-klab-app-03.dmz   <none>           <none>

KLAB/openshift-bcgov-perfmon ~ $ oc rsh perfmon-5576c95b44-6nd5j
(app-root) sh-4.4$ curl http://localhost:8000/metrics
# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 66.0
<...>
response_size_bytes{metric="REQUEST_SIZE",url="https://nginx-openshift-bcgov-nagios.apps.clab.devops.gov.bc.ca/test.txt"} 270.0
response_size_bytes{metric="SIZE_DOWNLOAD_T",url="https://nginx-openshift-bcgov-nagios.apps.clab.devops.gov.bc.ca/test.txt"} 5.24288e+06
# HELP response_count_total Response by code
# TYPE response_count_total counter
response_count_total{code="200",url="http://nginx-openshift-bcgov-nagios.apps.klab.devops.gov.bc.ca/"} 192.0
response_count_total{code="200",url="http://nginx-openshift-bcgov-nagios.apps.clab.devops.gov.bc.ca/"} 192.0
response_count_total{code="200",url="https://nginx-openshift-bcgov-nagios.apps.klab.devops.gov.bc.ca/"} 192.0
response_count_total{code="200",url="https://nginx-openshift-bcgov-nagios.apps.clab.devops.gov.bc.ca/"} 192.0
response_count_total{code="200",url="https://status.developer.gov.bc.ca/"} 192.0
response_count_total{code="200",url="http://nginx-openshift-bcgov-nagios.apps.klab.devops.gov.bc.ca/test.txt"} 192.0
response_count_total{code="200",url="http://nginx-openshift-bcgov-nagios.apps.clab.devops.gov.bc.ca/test.txt"} 192.0
response_count_total{code="200",url="https://nginx-openshift-bcgov-nagios.apps.klab.devops.gov.bc.ca/test.txt"} 192.0
response_count_total{code="200",url="https://nginx-openshift-bcgov-nagios.apps.clab.devops.gov.bc.ca/test.txt"} 192.0
# HELP response_count_created Response by code
# TYPE response_count_created gauge
response_count_created{code="200",url="http://nginx-openshift-bcgov-nagios.apps.klab.devops.gov.bc.ca/"} 1.6678475583501863e+09
response_count_created{code="200",url="http://nginx-openshift-bcgov-nagios.apps.clab.devops.gov.bc.ca/"} 1.6678475583546753e+09
<...>

On the sysdig webconsole;

image.png

So, I think that annotation is working as described in our clusters!

tmorik commented 2 years ago

I would like to know if it's possible to set up alerts based on these metrics, such as if response_count_created is crossed over X then send a warning alert to hoge@blah.com, etc, using sysdig.

ShellyXueHan commented 2 years ago

totally doable! You'll need to create a dashboard with the metrics there, then you can setup an alert for it. Since this is for our team, i'd recommend you to use the Platform Experience sysdig team to create the dashboard.

Here are more details on how-to:

tmorik commented 2 years ago

Great! Thank you! I will try those and add some notes about that.

tmorik commented 2 years ago

Sysdig notification has been set up and testing at https://app.sysdigcloud.com/#/alerts/rules?alertId=12883565&direction=asc&sortBy=name

Next I will look for Granting users permission to monitor user-defined projects

tmorik commented 2 years ago

In OCP 4.10, Alert routing for user-defined projects is still a Technology Preview.

For OCP4.11, it's not a TP .

It is possible to set up Alertmanager rule for a user-defined projects so that the granted user(s) by the monitoring-rules-edit role can create, modify, and deleting PrometheusRule custom resources for their project. Thus they can see alerts in the Openshift WebConsole as we (cluster-admins) are doing.

However, it's still a TP in OCP4.10, AND users already can easily set up Sysdig alerts for their pods. Probably this is not necessary.

StevenBarre commented 2 years ago

Agreed, not necessary while in TP. I think it would be good to have options once we get to 4.11, but we can revisit documenting alert routing in Feb.

tmorik commented 2 years ago

PRed doc (https://github.com/bcgov/platform-developer-docs/pull/145#pullrequestreview-1181226609) has been merged. I will close this ticket.