Closed jeffreychang911 closed 5 months ago
@jeffreychang911 ty, I'm looking into this.
Is there a way for me to see the previous successful run for this? Sorry, I'm poking around a bit but I don't see it. I'm trying to figure out whether what the versions of cos-proxy and ceph-mon were last time when things were working
Clarification to the original post: iiuc, the step in the CI that failed was deploying cos-proxy rev81 from latest/edge
based on versions here:
cos-proxy charm 81 latest/edge jammy
rev81 is not yet in latest/candidate
https://solutions.qa.canonical.com/testruns/8e866b56-c6f3-4d81-bb78-2e8fe20c96db, passed yesterday. K8s 1.29 with cos-proxy rev 74. and sorry, rev 81 is in edge channel.
ty @jeffreychang911 !
My guess is this bug is at least partly related to how pydantic >=2 changed how to define optional values. The schema here is:
class ScrapeJobModel(BaseModel):
class Config:
extra = "allow"
job_name: Optional[str] = Field(
description="Name of the Prometheus scrape job, each job must be given a unique name & should be a fixed string (e.g. hardcoded literal)",
default=None,
)
metrics_path: Optional[str] = Field(description="Path for metrics scraping.")
static_configs: List[ScrapeStaticConfigModel] = Field(
description="List of static configurations for scrape targets."
)
where we see:
metrics_path: Optional[str] = Field(...)
In pydantic 1, that means metrics_path
is optional. But in pydantic 2, the type annotation is not enough - there must be a default value set as well. For example:
metrics_path: Optional[str] = Field(..., default=None)
pydantic was bumped in #131 (rev73) though, and @jeffreychang911 reports rev74 has run previously and worked in their tests. So it feels like I'm still missing something here.
At minimum, the ScrapeJobModel
schema needs to be updated for pydantic>=2, and probably a unit test should exist somewhere (in the interface repo? here? not sure...) that would have caught this
I reproduced with this reduced bundle (but probably more can be stripped out):
# https://solutions.qa.canonical.com/testruns/36c8abcc-10c7-420c-a622-5af284a8353e
# https://oil-jenkins.canonical.com/artifacts/36c8abcc-10c7-420c-a622-5af284a8353e/index.html
# > file: generated/generated/kubernetes-maas/bundle.yaml
applications:
ceph-csi:
charm: ceph-csi
ceph-mon:
channel: quincy/stable
charm: ceph-mon
num_units: 3
options:
customize-failure-domain: false
expected-osd-count: 3
nagios_additional_checks: '{"ceph_slow_requests": "slow requests"}'
nagios_additional_checks_critical: true
source: distro
series: jammy
to:
- '0'
- '1'
- '2'
ceph-osd:
channel: quincy/stable
charm: ceph-osd
num_units: 3
options:
aa-profile-mode: complain
autotune: false
bluestore-compression-mode: none
customize-failure-domain: false
osd-devices: /dev/sda5 /dev/sdb2
source: distro
series: jammy
to:
- '3'
- '4'
- '5'
cos-proxy:
channel: latest/edge
charm: cos-proxy
num_units: 1
series: jammy
to:
- '6'
easyrsa:
charm: easyrsa
num_units: 1
series: jammy
to:
- '7'
grafana-agent:
channel: latest/stable
charm: grafana-agent
logrotated:
charm: logrotated
options:
logrotate-retention: 60
nrpe:
charm: nrpe
ntp:
charm: ntp
options:
source: ntp.ubuntu.com
verify_ntp_servers: true
machines:
'0':
series: jammy
'1':
'2':
'3':
'4':
'5':
series: jammy
'6':
series: jammy
'7':
series: jammy
series: jammy
relations:
- ['ceph-csi:ceph-client', 'ceph-mon:client']
- ['ceph-mon:juju-info', 'logrotated:juju-info']
- ['ceph-mon:juju-info', 'nrpe:general-info']
- ['ceph-mon:nrpe-external-master', 'nrpe:nrpe-external-master']
- ['ceph-mon:osd', 'ceph-osd:mon']
- ['ceph-mon:prometheus', 'cos-proxy:prometheus-target']
- ['ceph-osd:juju-info', 'logrotated:juju-info']
- ['ceph-osd:juju-info', 'nrpe:general-info']
- ['ceph-osd:juju-info', 'ntp:juju-info']
- ['ceph-osd:nrpe-external-master', 'nrpe:nrpe-external-master']
- ['cos-proxy:juju-info', 'nrpe:general-info']
- ['cos-proxy:monitors', 'nrpe:monitors']
- ['easyrsa:juju-info', 'logrotated:juju-info']
- ['easyrsa:juju-info', 'nrpe:general-info']
- ['nrpe:nrpe-external-master', 'ntp:nrpe-external-master']
- ['grafana-agent', 'cos-proxy']
Ah, I think this is it. #131 bumped pydantic, but there was a bug here where self._get_scrape_configs
was not called (should be self._get_scrape_configs()
). This was fixed in #137 here, but it appears cos-proxy's CI does not include testing this.
Looking briefly into it, I see a few unit tests for the prometheus-target
relation that I'd have expected to cover this case, but I guess not?
This is currently broken in edge, but the charm works in beta/candidate/stable, so we should not promote from edge to beta until we've addressed this issue
For anyone following this, the issue is in work this pulse and looks pretty easy to solve, so should have something in the next 2 weeks
@ca-scribner Somehow Rev 81 was promoted to stable channel yesterday. And this issue start to popup in SolQA testing.
I'm currently facing this issue as well in current deployments for customers. Would it be possible to update the charm currently in latest/stable with the fix ?
sorry, it looks like 81 got improperly promoted anyway. I've fixed it now and all latest/*
risks have a working rev
Bug Description
SolQA deploy COS-proxy within charmed kubernetes to relate to a COS on top of MicroK8s. And the cos-proxy unit reports
hook failed: "prometheus-target-relation-joined"
To Reproduce
deploy cos-proxy with charmed kubernetes 1.29, relate to a COS-lite on top of MicroK8s
Environment
This is cos-proxy rev 81 on edge channel, charmed kubernetes 1.29, with juju 3.5.1.
Relevant log output
Additional context
Test run - https://solutions.qa.canonical.com/testruns/36c8abcc-10c7-420c-a622-5af284a8353e Artifacts - https://oil-jenkins.canonical.com/artifacts/36c8abcc-10c7-420c-a622-5af284a8353e/index.html