canonical / kubeflow-dashboard-operator

Operator for Kubeflow Dashboard
Apache License 2.0
1 stars 1 forks source link

tests: refactor build and deploy test case #179

Open DnPlas opened 4 months ago

DnPlas commented 4 months ago

Remove the assertion that tests the kubeflow-dashboard-operator goes to Blocked when the relation with kubeflow-profiles is not there. This assertion is already covered in unit tests, and it is only adding noise at deployment time.

This commit ensures the dependencies like charms and relations are deployed on time which will give time for everything to be set up before moving ahead with other test cases.

ca-scribner commented 4 months ago

@DnPlas can you give more context? afaict 178 is failing for legitimate reasons. That charm is hitting error during a relation-changed event, not blocked.

In principle I think I'm with you about how a unit test is sufficient to test if we go Blocked when missing a relation. But now I'm wondering if by removing this wait_for_idle we end up hiding another error entirely.

DnPlas commented 4 months ago

@DnPlas can you give more context? afaict 178 is failing for legitimate reasons. That charm is hitting error during a relation-changed event, not blocked.

Yes, the reason why #178 is failing (and potentially any other CI) is because the kubeflow-profile-controller is in maintenance mode, in fact Waiting for pod startup to complete, which can cause the kubeflow-dashboard to fail with hook failed: "kubeflow-profiles-relation-changed" because the dashboard charm depends on the profiles info (see here). By deploying both kubeflow-dashboard and kubeflow-profiles at the same time and wait for idle for BOTH charms, we avoid this.

In principle I think I'm with you about how a unit test is sufficient to test if we go Blocked when missing a relation. But now I'm wondering if by removing this wait_for_idle we end up hiding another error entirely.

It's still there, we are actually waiting for BOTH charms to be active and idle with a timeout of 600s.

ca-scribner commented 4 months ago

The CI is terminating because kubeflow-dashboard is in error:

kubeflow-dashboard/0* error idle 10.1.34.139 hook failed: "kubeflow-profiles-relation-changed"

juju.errors.JujuUnitError: Unit in error: kubeflow-dashboard/0

It should not be possible that this sort of error be caused by kubeflow-profiles in any way (whether it is in maintenance mode, etc). So what I'm saying is that the the CI is catching a legitimate error here where kubeflow-dashboard fails to handle the relation-changed event, and changing the CI to suppress it is not what we should do

DnPlas commented 4 months ago

Thanks @ca-scribner for the review, agreed on the model config setting. Will update the PR shortly.