canonical / prometheus-k8s-operator

https://charmhub.io/prometheus-k8s
Apache License 2.0
21 stars 35 forks source link

Fix label injection in prometheus_remote_write #322

Closed rbarry82 closed 2 years ago

rbarry82 commented 2 years ago

Issue

In moving to observability_libs.juju_topology in #297, topology.render was also removed. The code was stubbed out with CosTool.inject_label_matchers, but since Provider|Consumer are inverted in prometheus_remote_write, the label injection never actually happened, which the existing integration/unit tests did not catch.

Solution

Use the new JujuTopology for the identifier, and add label injection to the "Provider" (Prometheus) side of the library where we know it will be present. Add an integration test since this cannot be reliably detected in unit tests.

Release Notes

Fix label injection in prometheus_remote_write

github-actions[bot] commented 2 years ago

Libraries are not up to date with their remote counterparts. If this was not intentional, run charmcraft fetch-libs and commit the updated libs to your PR branch.

stdout ``` Library charms.alertmanager_k8s.v0.alertmanager_dispatch was already up to date in version 0.4. Library charms.grafana_k8s.v0.grafana_dashboard was already up to date in version 0.12. Library charms.grafana_k8s.v0.grafana_source was already up to date in version 0.10. Library charms.observability_libs.v0.juju_topology was already up to date in version 0.2. Library charms.observability_libs.v0.kubernetes_service_patch was already up to date in version 0.6. Library charms.prometheus_k8s.v0.prometheus_remote_write has local changes, cannot be updated. Library charms.prometheus_k8s.v0.prometheus_scrape has local changes, cannot be updated. Library charms.traefik_k8s.v0.ingress_per_unit has local changes, cannot be updated. ```
stderr ``` ```
github-actions[bot] commented 2 years ago

Libraries are not up to date with their remote counterparts. If this was not intentional, run charmcraft fetch-libs and commit the updated libs to your PR branch.

stdout ``` Library charms.alertmanager_k8s.v0.alertmanager_dispatch was already up to date in version 0.4. Library charms.grafana_k8s.v0.grafana_dashboard was already up to date in version 0.12. Library charms.grafana_k8s.v0.grafana_source was already up to date in version 0.10. Library charms.observability_libs.v0.juju_topology was already up to date in version 0.2. Library charms.observability_libs.v0.kubernetes_service_patch was already up to date in version 0.6. Library charms.prometheus_k8s.v0.prometheus_remote_write has local changes, cannot be updated. Library charms.prometheus_k8s.v0.prometheus_scrape has local changes, cannot be updated. Library charms.traefik_k8s.v0.ingress_per_unit has local changes, cannot be updated. ```
stderr ``` ```
rbarry82 commented 2 years ago

Bumped LIBPATCH to prep for Grafana agent, but... this also handles deduplication as a bonus and that's scary.

It "handles" it because with a shortened topology identifier so remote write matches up with direct relations, it just clobbers the file here. It's technically correct, which is the worst sort of correct, and it smells like a bug waiting to happen:

local = {'alerts_937d05a2_tester': {'groups': [{'name': 'alerts_937d05a2_tester_cpu_overuse_alerts',                                                                                                                                          
                                          'rules': [{'alert': 'CPUOverUse',                                                                                                                                                                     
                                                     'annotations': {                                                                                                                                                                           
                                                          'description': '{{$labels.instance}} of job{ $labels.job }} '                                                                                                                         
                                                                                  'has used too much CPU.',                                                                                                                                     
                                                          'summary': 'Instance {{ $labels.instance }} CPU overuse'                                                                                                                              
                                                      },                                                                                                                                                                                        
                                                     'expr': 'process_cpu_seconds_total{juju_application="tester",juju_charm="prometheus-tester",juju_model="alerts",juju_model_uuid="937d05a2-9412-46af-829b-c99c43fee856"} > 0.12',           
                                                     'for': '0m',                                                                                                                                                                               
                                                     'labels': {'juju_application': 'tester',                                                                                                                                                   
                                                                'juju_charm': 'prometheus-tester',                                                                                                                                              
                                                                'juju_model': 'alerts',                                                                                                                                                         
                                                                'juju_model_uuid': '937d05a2-9412-46af-829b-c99c43fee856',                                                                                                                      
                                                                'severity': 'Low'}}]}]}}                                                                                                                                                        

remote = {'alerts_937d05a2_tester': {'groups': [{'name': 'alerts_937d05a2-9412-46af-829b-c99c43fee856_grafana-agent_grafana-agent-k8s_alerts_937d05a2_tester_cpu_overuse_alerts_alerts',                                                      
                                          'rules': [{'alert': 'CPUOverUse',                                                                                                                                                                     
                                                     'annotations': {                                                                                                                                                                           
                                                          'description': '{{$labels.instance}} of job{ $labels.job }} '                                                                                                                         
                                                                                  'has used too much CPU.',                                                                                                                                     
                                                          'summary': 'Instance {{ $labels.instance }} CPU overuse'                                                                                                                              
                                                      },                                                                                                                                                                                        
                                                     'expr': 'process_cpu_seconds_total{juju_application="tester",juju_charm="prometheus-tester",juju_model="alerts",juju_model_uuid="937d05a2-9412-46af-829b-c99c43fee856"} > 0.12',           
                                                     'for': '0m',                                                                                                                                                                               
                                                     'labels': {'juju_application': 'tester',                                                                                                                                                   
                                                                'juju_charm': 'prometheus-tester',                                                                                                                                              
                                                                'juju_model': 'alerts',                                                                                                                                                         
                                                                'juju_model_uuid': '937d05a2-9412-46af-829b-c99c43fee856',                                                                                                                      
                                                                'severity': 'Low'}}]}]}}  

Are these the same? No, they are not, and won't be until grafana agent gets the new remote write which uses the "short" topology identifier. {identifier}["groups"][0]["name"] differs. Should the filename/identifier be the same then? Probably not.

We should dedupe, but we should also think through a pattern for not accidentally clobbering files. While I can't think of a set of relations right now which would lead to {juju_model}_{juju_model_uuid_short}_{juju_application} as a filename with different results, I'm sure someone smarter than me can contrive a path where it is related in a path where a "direct" relation and an indirect relation (cos-config->prometheus-scrape-*->grafana-agent->prometheus, maybe) yields different results with the same identifier.

I want to open a new issue to fix that along with the dedupe, but I cannot think of a way to make it happen, so maybe it's just paranoia and this is actually ok. @balbirthomas @sed-i @dstathis @simskij can one of contrive a way to bust it?