canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
103 stars 50 forks source link

`full-bundle-tests.yaml` fail on self-hosted runners #788

Open orfeas-k opened 9 months ago

orfeas-k commented 9 months ago

Bug Description

At the moment, the behavior of full-bundle-tests.yaml on self-hosted runners is really flaky. We have seen them failing with the following errors:

  1. The following happens during Setup operator environment step.
    /usr/bin/sudo snap install core
    error: cannot install "core": Post "https://api.snapcraft.io/v2/snaps/refresh":
         dial tcp [18](https://github.com/canonical/bundle-kubeflow/actions/runs/7098729518/job/19405504840#step:8:19)5.125.188.58:443: connect: connection refused
  2. This happens during bundle deployment and while bundle has been deployed and it shows missing for all charms that it waits for. wait_for_idle acts like no charm has been deployed to the model it observes.
    INFO     juju.model:model.py:2947 Waiting for model:
    admission-webhook (missing)
    ... (missing)
  3. The action never starts. At the time of writing, this action has been in queue for 1 day and 10 minutes.

The above scenarios happen after we 've seen them running successfully (almost - one UAT failed), and with no change to the tests themselves whatsoever.

To Reproduce

Run the action https://github.com/canonical/bundle-kubeflow/actions/workflows/full-bundle-tests.yaml

Environment

self-hosted runners

Relevant Log Output

.

Additional Context

No response

syncronize-issues-to-jira[bot] commented 9 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5134.

This message was autogenerated

NohaIhab commented 9 months ago

thank you @orfeas-k also noting that the one UAT we observed to fail when the tests were able to run was training-integration and it's being tracked in https://github.com/canonical/charmed-kubeflow-uats/issues/50