containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
22.4k stars 2.31k forks source link

Fix hang boot podman #22985

Open lambinoo opened 3 weeks ago

lambinoo commented 3 weeks ago

This is an attempt at fixing #22984.

It seems that if some background tasks are queued in libpod's Runtime before the worker's channel is set up (eg. in the refresh phase), they are not executed later on, but the workerGroup's counter is still ticked up. This leads podman to hang when the imageEngine is shutdown, since it waits for the workerGroup to be done.

I'm not really sure why the function to queue a job works, as the queue object shouldn't exist, but I think that might be caused by my very sparse Go knowledge.

I would probably need some guidance to write a test for that one, as my test workflow has been:

  1. Make a change to podman
  2. reboot -f
  3. Login and run journalctl list-jobs to monitor the boot process

And I'm honestly a bit confused at how do the reboot part in the tests.

None
openshift-ci[bot] commented 3 weeks ago

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
openshift-ci[bot] commented 3 weeks ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lambinoo Once this PR has been reviewed and has the lgtm label, please assign mheon for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/containers/podman/blob/main/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
baude commented 3 weeks ago

@mheon ptal

mheon commented 2 weeks ago

I don't actually know why we're adding jobs to the channel in a goroutine, but it does make things convenient for this case. I imagine it ought to be all right - we're effectively queuing jobs for completion once the parallel executor comes up.

Though perhaps we should just make the parallel executor come up sooner in runtime init? I don't think it has any dependencies on things like c/storage - is there a reason not to initialize it as the very first thing a runtime does?

Luap99 commented 2 weeks ago

I don't actually know why we're adding jobs to the channel in a goroutine, but it does make things convenient for this case. I imagine it ought to be all right - we're effectively queuing jobs for completion once the parallel executor comes up.

I think his was caused by us adding removal code to the refresh logic: https://github.com/containers/podman/commit/9983e87440760883b2a8ecd83d93f1b0dd78d44a

My best guess is that we remove some container in a pod that then causes the container/pod stop/removal logic to add jobs to the queue.

Though perhaps we should just make the parallel executor come up sooner in runtime init? I don't think it has any dependencies on things like c/storage - is there a reason not to initialize it as the very first thing a runtime does?

Well it depends what jobs are submitted, I think it makes sense to wait at least for the runtime.valid = true step. But one could also say nobody should submit jobs that depend on the runtime unless they already have a valid runtime...

In general it is really not well defined how the queue is supposed to work.

lambinoo commented 2 weeks ago

I don't actually know why we're adding jobs to the channel in a goroutine, but it does make things convenient for this case. I imagine it ought to be all right - we're effectively queuing jobs for completion once the parallel executor comes up.

Though perhaps we should just make the parallel executor come up sooner in runtime init? I don't think it has any dependencies on things like c/storage - is there a reason not to initialize it as the very first thing a runtime does?

At first, I did just move the setup worker function up, but podman crashed in the jobs, hence why I moved the queue setup in a separate function.