canonical / charmed-kubeflow-chisme

Shared Utilities used across Charmed Kubeflow
Apache License 2.0
1 stars 4 forks source link

O11y testing abstraction is broken #104

Closed rgildein closed 4 weeks ago

rgildein commented 1 month ago

Bug Description

The charm grafana-agent-k8s was moved back to revision 45 in stable channel. (e.g. this action shown that rev 68 was used before) This cause an issue, since status message is different from it was expected.

After some discussion in team, we think the abstraction she should not obey the workload status message after grafana-agent-k8s is deployed. It means that these lines should be removed.

Currently, the workaround is to deploy grafana-agent-k8s from latest/edge channel , e.g. deploy_and_assert_grafana_agent(..., channel="latest/edge", ...)

To Reproduce

tox -e integration

Environment

Juju 3.4+ Microk8s 1.24+

Relevant Log Output

File "/home/runner/work/kfp-operators/kfp-operators/.tox/integration/lib/python3.8/site-packages/charmed_kubeflow_chisme/testing/cos_integration.py", line 116, in deploy_and_assert_grafana_agent
    assert GRAFANA_AGENT_MESSAGE.match(msg), error_msg
AssertionError: grafana-agent-k8s did not reach expected state. 'grafana-cloud-config: off, logging-consumer: off' != 'Missing (?=.*\['grafana-cloud-config'\]|\['grafana-dashboards-provider'\] for grafana-dashboards-consumer)(?=.*\['grafana-cloud-config'\]|\['logging-consumer'\] for logging-provider)(?=.*\['grafana-cloud-config'\]|\['send-remote-write'\] for metrics-endpoint)'

Additional Context

No response

syncronize-issues-to-jira[bot] commented 1 month ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5896.

This message was autogenerated

DnPlas commented 1 month ago

As @rgildein mentioned in the description of the issue, this error will go away if we remove these lines from the code.

I have a few comments on the overall design of this method, though.

deploy_and_assert_grafana_agent will deploy the grafana-agent-k8s charm into the testing model, then depending on the charm, some integrations will be added (metrics, dashboard, logging), and finally we will check if the grafana-agent-k8s unit goes into BlockedStatus. We expect it to go to this status because we are not integrating it with any of the "consumers", that is, none of the COS charms like loki or grafana are deployed and integrated in our testing framework, so by design grafana-agent-k8s will just block (hence the logging-consumer: off, grafana-cloud-config: off message).

Across the cos_integration code, the grafana-agent-k8s charm is only used for curling the metrics endpoint of the charm under test.

It looks to me that:

  1. Except for curling from the unit, the grafana-agent-k8s charm is not used at all, we are just deploying and integrating it with the charm under test, but not actual functionality is used during the execution of the test.
  2. We cannot rely on the grafana-agent-k8scharm unit going into BlockedStatus based on the knowledge we have about the charm. It is a not the best practice to rely on a unit status other thanActive` as it can be affected by different scenarios and it is prone to change (as we saw with this issue).
  3. Maybe we do not even need the grafana agent at all if all we are doing is checking the metrics endpoint can be reached.
DnPlas commented 4 weeks ago

Re-opening as we still need to bump chisme versions in repos where this code is used.

DnPlas commented 4 weeks ago

Chisme 0.4.1 is out, and is now updated in all repos. We can close this task.