canonical / mysql-k8s-operator

A Charmed Operator for running MySQL on Kubernetes
https://charmhub.io/mysql-k8s
Apache License 2.0
8 stars 16 forks source link

DPE-3557 Workaround pvc transient error #465

Closed paulomach closed 4 months ago

paulomach commented 4 months ago

Issue

Integration tests often fails on a PVC failing to be created

Solution

Workaround it by not raise error on wait_for_idle when run after mysql-k8s deploy/scale commands (i.e. when PVC is created)

Credits to @lucasgameiroborges

Fixes #378

taurus-forever commented 4 months ago

If this really helps, why error state still failing this.

Was it permanent error state and not temporary one?

lucasgameiroborges commented 4 months ago

Hi guys, here to provide some context for why this change was adopted over on PostgreSQL side:

If this really helps, why error state still failing this.

@taurus-forever It seems the change was missing in this particular wait_for_idle: https://github.com/canonical/mysql-k8s-operator/blob/ea997403bb0689886e7f5ac55888535d8304eb6c/tests/integration/test_charm.py#L65C1-L71C10

I am a little worried about the charms going into error state (for reasons other than PVC creation), and then eventually update_status coming along and setting the status becoming active from retries and us not catching this case due to the raise_on_error=False

@shayancanonical Me and Dragomir (who originally proposed this change in the first place on my PR, I had a different approach at first) discussed the implications of not raising errors on the wait_for_idle calls, the main point being: It is not that we'll completely ignore the errors, it is just that they won't get raised in the very first wait_for_idle after charm deploy. Sure, transient errors like the PVC one might get ignored completely, but persistent errors will either get the wait_for_idle to timeout due to charm never getting active status, or something breaking further ahead in the test. About the status becoming active from retries, wouldn't this mean the charm actually recovered during said retries? Unless the hooks are setting active status when they shouldn't.

In any case, this is not a proper fix by any means, but more of a workaround to avoid annoying transient errors right after charm deploy on CI.

paulomach commented 4 months ago

I am a little worried about the charms going into error state (for reasons other than PVC creation), and then eventually ~update_status coming along and setting~ the status becoming active from retries and us not catching this case due to the raise_on_error=False

As Lucas said, we will still will catch the errors, but when timeout is reached. IMO if we can reach a better stability, It's a valid workaround.