canonical / cos-proxy-operator

https://charmhub.io/cos-proxy
Apache License 2.0
2 stars 12 forks source link

COS-proxy report `hook failed: "prometheus-target-relation-joined"` #140

Closed jeffreychang911 closed 5 months ago

jeffreychang911 commented 5 months ago

Bug Description

SolQA deploy COS-proxy within charmed kubernetes to relate to a COS on top of MicroK8s. And the cos-proxy unit reports hook failed: "prometheus-target-relation-joined"

To Reproduce

deploy cos-proxy with charmed kubernetes 1.29, relate to a COS-lite on top of MicroK8s

Environment

This is cos-proxy rev 81 on edge channel, charmed kubernetes 1.29, with juju 3.5.1.

Relevant log output

2024-06-11 07:24:27 DEBUG juju.worker.uniter.operation executor.go:135 executing operation "run relation-joined (21; unit: ceph-mon/0) hook" for cos-proxy/0
2024-06-11 07:24:27 DEBUG juju.worker.uniter agent.go:22 [AGENT-STATUS] executing: running prometheus-target-relation-joined hook for ceph-mon/0
2024-06-11 07:24:27 DEBUG juju.worker.uniter.runner runner.go:719 starting jujuc server  {unix @/var/lib/juju/agents/unit-cos-proxy-0/agent.socket <nil>}
2024-06-11 07:24:28 DEBUG unit.cos-proxy/0.juju-log server.go:325 prometheus-target:21: ops 2.15.0.dev0 up and running.
2024-06-11 07:24:28 ERROR unit.cos-proxy/0.juju-log server.go:325 prometheus-target:21: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/./src/charm.py", line 643, in <module>
    main(COSProxyCharm)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/main.py", line 546, in main
    manager = _Manager(charm_class, use_juju_for_storage=use_juju_for_storage)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/main.py", line 429, in __init__
    self.charm = self._make_charm(self.framework, self.dispatcher)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/ops/main.py", line 432, in _make_charm
    charm = self._charm_class(framework)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/./src/charm.py", line 133, in __init__
    scrape_configs=self._get_scrape_configs(),
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/./src/charm.py", line 264, in _get_scrape_configs
    target_job = ScrapeJobModel(**target_job_data)
  File "/var/lib/juju/agents/unit-cos-proxy-0/charm/venv/pydantic/main.py", line 176, in __init__
    self.__pydantic_validator__.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for ScrapeJobModel
metrics_path
  Field required [type=missing, input_value={'job_name': 'juju_kubern...nce', 'regex': '(.*)'}]}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.7/v/missing
2024-06-11 07:24:28 ERROR juju.worker.uniter.operation runhook.go:180 hook "prometheus-target-relation-joined" (via hook dispatching script: dispatch) failed: exit status 1

Additional context

Test run - https://solutions.qa.canonical.com/testruns/36c8abcc-10c7-420c-a622-5af284a8353e Artifacts - https://oil-jenkins.canonical.com/artifacts/36c8abcc-10c7-420c-a622-5af284a8353e/index.html

ca-scribner commented 5 months ago

@jeffreychang911 ty, I'm looking into this.

Is there a way for me to see the previous successful run for this? Sorry, I'm poking around a bit but I don't see it. I'm trying to figure out whether what the versions of cos-proxy and ceph-mon were last time when things were working

ca-scribner commented 5 months ago

Clarification to the original post: iiuc, the step in the CI that failed was deploying cos-proxy rev81 from latest/edge based on versions here:

cos-proxy charm 81 latest/edge jammy

rev81 is not yet in latest/candidate

jeffreychang911 commented 5 months ago

https://solutions.qa.canonical.com/testruns/8e866b56-c6f3-4d81-bb78-2e8fe20c96db, passed yesterday. K8s 1.29 with cos-proxy rev 74. and sorry, rev 81 is in edge channel.

ca-scribner commented 5 months ago

ty @jeffreychang911 !

ca-scribner commented 5 months ago

My guess is this bug is at least partly related to how pydantic >=2 changed how to define optional values. The schema here is:

class ScrapeJobModel(BaseModel):
    class Config:
        extra = "allow"

    job_name: Optional[str] = Field(
        description="Name of the Prometheus scrape job, each job must be given a unique name &  should be a fixed string (e.g. hardcoded literal)",
        default=None,
    )
    metrics_path: Optional[str] = Field(description="Path for metrics scraping.")
    static_configs: List[ScrapeStaticConfigModel] = Field(
        description="List of static configurations for scrape targets."
    )

where we see: metrics_path: Optional[str] = Field(...)

In pydantic 1, that means metrics_path is optional. But in pydantic 2, the type annotation is not enough - there must be a default value set as well. For example: metrics_path: Optional[str] = Field(..., default=None)

pydantic was bumped in #131 (rev73) though, and @jeffreychang911 reports rev74 has run previously and worked in their tests. So it feels like I'm still missing something here.

At minimum, the ScrapeJobModel schema needs to be updated for pydantic>=2, and probably a unit test should exist somewhere (in the interface repo? here? not sure...) that would have caught this

ca-scribner commented 5 months ago

I reproduced with this reduced bundle (but probably more can be stripped out):

# https://solutions.qa.canonical.com/testruns/36c8abcc-10c7-420c-a622-5af284a8353e
# https://oil-jenkins.canonical.com/artifacts/36c8abcc-10c7-420c-a622-5af284a8353e/index.html
# > file: generated/generated/kubernetes-maas/bundle.yaml
applications:
  ceph-csi:
    charm: ceph-csi
  ceph-mon:
    channel: quincy/stable
    charm: ceph-mon
    num_units: 3
    options:
      customize-failure-domain: false
      expected-osd-count: 3
      nagios_additional_checks: '{"ceph_slow_requests": "slow requests"}'
      nagios_additional_checks_critical: true
      source: distro
    series: jammy
    to:
    - '0'
    - '1'
    - '2'
  ceph-osd:
    channel: quincy/stable
    charm: ceph-osd
    num_units: 3
    options:
      aa-profile-mode: complain
      autotune: false
      bluestore-compression-mode: none
      customize-failure-domain: false
      osd-devices: /dev/sda5 /dev/sdb2
      source: distro
    series: jammy
    to:
    - '3'
    - '4'
    - '5'
  cos-proxy:
    channel: latest/edge
    charm: cos-proxy
    num_units: 1
    series: jammy
    to:
    - '6'
  easyrsa:
    charm: easyrsa
    num_units: 1
    series: jammy
    to:
    - '7'
  grafana-agent:
    channel: latest/stable
    charm: grafana-agent
  logrotated:
    charm: logrotated
    options:
      logrotate-retention: 60
  nrpe:
    charm: nrpe
  ntp:
    charm: ntp
    options:
      source: ntp.ubuntu.com
      verify_ntp_servers: true
machines:
  '0':
    series: jammy
  '1':
  '2':
  '3':
  '4':
  '5':
    series: jammy
  '6':
    series: jammy
  '7':
    series: jammy
series: jammy
relations:
- ['ceph-csi:ceph-client', 'ceph-mon:client']
- ['ceph-mon:juju-info', 'logrotated:juju-info']
- ['ceph-mon:juju-info', 'nrpe:general-info']
- ['ceph-mon:nrpe-external-master', 'nrpe:nrpe-external-master']
- ['ceph-mon:osd', 'ceph-osd:mon']
- ['ceph-mon:prometheus', 'cos-proxy:prometheus-target']
- ['ceph-osd:juju-info', 'logrotated:juju-info']
- ['ceph-osd:juju-info', 'nrpe:general-info']
- ['ceph-osd:juju-info', 'ntp:juju-info']
- ['ceph-osd:nrpe-external-master', 'nrpe:nrpe-external-master']
- ['cos-proxy:juju-info', 'nrpe:general-info']
- ['cos-proxy:monitors', 'nrpe:monitors']
- ['easyrsa:juju-info', 'logrotated:juju-info']
- ['easyrsa:juju-info', 'nrpe:general-info']
- ['nrpe:nrpe-external-master', 'ntp:nrpe-external-master']
- ['grafana-agent', 'cos-proxy']
ca-scribner commented 5 months ago

Ah, I think this is it. #131 bumped pydantic, but there was a bug here where self._get_scrape_configs was not called (should be self._get_scrape_configs()). This was fixed in #137 here, but it appears cos-proxy's CI does not include testing this.

Looking briefly into it, I see a few unit tests for the prometheus-target relation that I'd have expected to cover this case, but I guess not?

ca-scribner commented 5 months ago

This is currently broken in edge, but the charm works in beta/candidate/stable, so we should not promote from edge to beta until we've addressed this issue

ca-scribner commented 5 months ago

For anyone following this, the issue is in work this pulse and looks pretty easy to solve, so should have something in the next 2 weeks

jeffreychang911 commented 4 months ago

@ca-scribner Somehow Rev 81 was promoted to stable channel yesterday. And this issue start to popup in SolQA testing.

odufourc commented 4 months ago

I'm currently facing this issue as well in current deployments for customers. Would it be possible to update the charm currently in latest/stable with the fix ?

ca-scribner commented 4 months ago

sorry, it looks like 81 got improperly promoted anyway. I've fixed it now and all latest/* risks have a working rev