balchua / microk8s-actions

Bootstrap MicroK8s with Github Actions
34 stars 9 forks source link

exec: error: timed out waiting for the condition #20

Open barrettj12 opened 1 year ago

barrettj12 commented 1 year ago

Our workflow:

        uses: balchua/microk8s-actions@1e8e626239c2befe7cd5d258c96ae152a7259c74
        with:
          channel: "1.25-strict/stable"
          addons: '["dns", "hostpath-storage"]'

Logs:

Waiting for hostpath-storage to be ready 
exec sudo microk8s kubectl rollout status deployment/hostpath-provisioner -n kube-system --timeout=90s { silent: true }
Error: exec: error: timed out waiting for the condition

See the full run here.

balchua commented 12 months ago

@barrettj12 thanks for logging the issue. I haven't tried hostpath with strict confinement. Let me try that out. Will get back to you.

balchua commented 12 months ago

@barrettj12 i couldn't reproduce the error you are getting, i tried it with several MicroK8s versions. See the jobs here.

But as a precaution, i increased the timeout to 120s hoping that it will alleviate such scenario. Do you mind trying the new release? https://github.com/balchua/microk8s-actions/releases/tag/v0.4.1

Thanks,

barrettj12 commented 12 months ago

@balchua I'll try that, thanks. How about making that timeout configurable, so users can find their own value that works?

balchua commented 12 months ago

Thanks. I was thinking of that while working on this issue. Do you have an idea on how to present this knob to the user? Appreciate your thoughts.

barrettj12 commented 12 months ago

Probably just an input to the action would be fine:

        uses: balchua/microk8s-actions@1e8e626239c2befe7cd5d258c96ae152a7259c74
        with:
          channel: "1.25-strict/stable"
          addons: '["dns", "hostpath-storage"]'
          timeout: 120s
balchua commented 12 months ago

Was wondering should the action even check for the readiness of the addon. Perhaps we can leave that to the user.

SimonRichardson commented 12 months ago

As @barrettj12 has pointed out, this is happening a lot in the Juju project. Is there a way to expose why it's not passing in a given time, just from the test output? It's clear that there is a possible unresolved issue that we're not exposing.

balchua commented 12 months ago

The existing code is waiting for the hostpath provisioner to be ready and fails when it times out. I think it shouldn't fail the build when its not ready. However it may give an impression to the user that everything seems to be ok when its not, hence i added the check.
I guess it is leading to more harm than it should be. So what i'll do is to continue waiting for a specified amount of time (its 120s for now) but do not fail the build when its still not ready.

barrettj12 commented 11 months ago

I think 120s will still be too short in some cases. We are now running self-hosted runners which may have surfaced this issue.

If things are ready before the timeout, the command will still exit early, right? In which case, it should be safe to set the timeout to something large like 10 minutes - as it shouldn't take this long ever.

balchua commented 11 months ago

If things are ready before the timeout, the command will still exit early, right? In which case, it should be safe to set the timeout to something large like 10 minutes - as it shouldn't take this long ever.

Yes it will exit early. Thanks for the feedback! I will probably implement it this way.

barrettj12 commented 11 months ago

@balchua I just tested v0.4.1, unfortunately it doesn't solve our issue. See some failed runs here: https://github.com/juju/juju/actions/runs/6171575601/job/16750097840?pr=16242 https://github.com/juju/juju/actions/runs/6171575585/job/16750097890?pr=16242

The first one in particular is really strange - getting what looks like a stack trace.

balchua commented 11 months ago

You are right. The first one took 2h before it threw that exception. I've never came across such error and took that long. While the second one took 12m before it finally gave up. Both error are strange. Could it be that there's something wrong with your runner host?