canonical / prometheus-k8s-operator

https://charmhub.io/prometheus-k8s
Apache License 2.0
21 stars 35 forks source link

MetricsEndpointProvider units are not always reachable via fqdn #329

Closed sed-i closed 2 years ago

sed-i commented 2 years ago

Bug Description

293 replaced network.bind_address() with socket.getfqdn(). This works for in-model relations but breaks for cross-cluster relations.

cc: @mateoflorido @stonepreston

To Reproduce

Form a cross-cluster prometheus_scrape relation b/w a machine charm and the prometheus-k8s charm.

Environment

we have been trying to integrate the prometheus-k8s charm with kube-ovn charm. But it seems that the library gathers the instance-id of our charm instead of the IP address of the unit, therefore metrics are not being collected in the process.

they are in different models. We have a CK cluster that has the kube-ovn charm as it's CNI charm.

Kube-OVN is a subordinate machine charm. -- @mateoflorido

Relevant log output

~$ curl 10.152.183.72:9090/api/v1/targets

{
    "status": "success",
    "data": {
        "activeTargets": [
            {
                "discoveredLabels": {
                    "__address__": "juju-865c1a-5:10665",
                    "__metrics_path__": "/metrics",
                    "__scheme__": "http",
                    "job": "juju_test-kovn-prom_e7559de7-acde-415f-84fa-eeff47865c1a_kube-ovn_kube-ovn_prometheus_scrape",
                    "juju_application": "kube-ovn",
                    "juju_charm": "kube-ovn",
                    "juju_model": "test-kovn-prom",
                    "juju_model_uuid": "e7559de7-acde-415f-84fa-eeff47865c1a",
                    "juju_unit": "kube-ovn/4",
                },
                "labels": {
                    "instance": "test-kovn-prom_e7559de7-acde-415f-84fa-eeff47865c1a_kube-ovn_kube-ovn/4",
                    "job": "juju_test-kovn-prom_e7559de7-acde-415f-84fa-eeff47865c1a_kube-ovn_kube-ovn_prometheus_scrape",
                    "juju_application": "kube-ovn",
                    "juju_charm": "kube-ovn",
                    "juju_model": "test-kovn-prom",
                    "juju_model_uuid": "e7559de7-acde-415f-84fa-eeff47865c1a",
                    "juju_unit": "kube-ovn/4",
                },
                "scrapePool": "juju_test-kovn-prom_e7559de7-acde-415f-84fa-eeff47865c1a_kube-ovn_kube-ovn_prometheus_scrape",
                "scrapeUrl": "http://juju-865c1a-5:10665/metrics",
                "globalUrl": "http://juju-865c1a-5:10665/metrics",
                "lastError": 'Get "http://juju-865c1a-5:10665/metrics": dial tcp: lookup juju-865c1a-5 on 10.152.183.213:53: no such host',
                "lastScrape": "2022-07-15T15:27:59.615508445Z",
                "lastScrapeDuration": 0.005049553,
                "health": "down",
            },
            {
                "discoveredLabels": {
                    "__address__": "juju-865c1a-8:10665",
                    "__metrics_path__": "/metrics",
                    "__scheme__": "http",
                    "job": "juju_test-kovn-prom_e7559de7-acde-415f-84fa-eeff47865c1a_kube-ovn_kube-ovn_prometheus_scrape",
                    "juju_application": "kube-ovn",
                    "juju_charm": "kube-ovn",
                    "juju_model": "test-kovn-prom",
                    "juju_model_uuid": "e7559de7-acde-415f-84fa-eeff47865c1a",
                    "juju_unit": "kube-ovn/0",
                },
                "labels": {
                    "instance": "test-kovn-prom_e7559de7-acde-415f-84fa-eeff47865c1a_kube-ovn_kube-ovn/0",
                    "job": "juju_test-kovn-prom_e7559de7-acde-415f-84fa-eeff47865c1a_kube-ovn_kube-ovn_prometheus_scrape",
                    "juju_application": "kube-ovn",
                    "juju_charm": "kube-ovn",
                    "juju_model": "test-kovn-prom",
                    "juju_model_uuid": "e7559de7-acde-415f-84fa-eeff47865c1a",
                    "juju_unit": "kube-ovn/0",
                },
                "scrapePool": "juju_test-kovn-prom_e7559de7-acde-415f-84fa-eeff47865c1a_kube-ovn_kube-ovn_prometheus_scrape",
                "scrapeUrl": "http://juju-865c1a-8:10665/metrics",
                "globalUrl": "http://juju-865c1a-8:10665/metrics",
                "lastError": 'Get "http://juju-865c1a-8:10665/metrics": dial tcp: lookup juju-865c1a-8 on 10.152.183.213:53: no such host',
                "lastScrape": "2022-07-15T15:28:30.262574207Z",
                "lastScrapeDuration": 0.00596523,
                "health": "down",
            },
            {
                "discoveredLabels": {
                    "__address__": "localhost:9090",
                    "__metrics_path__": "/metrics",
                    "__scheme__": "http",
                    "job": "prometheus",
                },
                "labels": {"instance": "localhost:9090", "job": "prometheus"},
                "scrapePool": "prometheus",
                "scrapeUrl": "http://localhost:9090/metrics",
                "globalUrl": "http://192.168.0.17:9090/metrics",
                "lastError": "",
                "lastScrape": "2022-07-15T15:28:46.369824673Z",
                "lastScrapeDuration": 0.006802522,
                "health": "up",
            },
        ],
        "droppedTargets": [],
    },
}

Additional context

No response

sed-i commented 2 years ago

Add support for cross-controller/cross-cloud prometheus_scrape relation?

Background

The scraping relation was originally intended for in-model use only, with the idea that from a topology standpoint the correct approach is remote-write with grafana agent.

However, it should still be possible to use the prometheus_scrape relation cross-controller/cross-cloud.

bind_address

One option is to go back to using bind_address. However, there is still a juju bug for which occasionally ops ends up returning None for bind_address (on launchpad it is marked as "fix released" but the bug is still there), making the charm startup as long as the update-status hook interval in some cases.

external_url

Another option could be to add an optional argument, e.g. external_url to the MetricsEndpointProvider constructor, so users could pass it bind_address or their charm's self.ingress.url themselves.

Would need to rethink the *-notation, because we would need the external_url of every unit. One option is to have every unit update peer relation data.

simskij commented 2 years ago

One option is to have every unit update peer relation data.

We can't really demand that the remote charm implements this though.

However, it should still be possible to use the prometheus_scrape relation cross-controller/cross-cloud.

I'm not sure about this. As I expressed at the end of our last conversation, I'm somewhat opposed to the whole idea of cross-controller scrapes. I think the "right" solution here also is the only actually viable solution.

sed-i commented 2 years ago

We can't really demand that the remote charm implements this though.

Agreed, that is something prometheus_scrape would have to do on behalf of the charm.

I'm somewhat opposed to the whole idea of cross-controller scrapes. I think the "right" solution here also is the only actually viable solution.

I wonder if we can detect a cross model relation and block with e.g. "Not supported; use remote-write instead".

balbirthomas commented 2 years ago

A PR reverting back to using bind_address by default and falling back to fqdn() is available here.