docker / cli

The Docker CLI
Apache License 2.0
4.88k stars 1.92k forks source link

Fix hang when container fails to start #5060

Closed laurazard closed 5 months ago

laurazard commented 5 months ago

- What I did

fixes https://github.com/docker/cli/issues/5053

When running a container, if it fails to start, we immediately cancel the context used for waitExitOrRemoved and, if running the container with --rm, wait on the channel:

https://github.com/docker/cli/blob/870ad7f4b9910341f30dd2632db4375858455d8a/cli/command/container/run.go#L196-L205

However, in https://github.com/docker/cli/commit/840016ea0504fb1c616a3af2f8d48bd16a7400f4, we added the following line https://github.com/docker/cli/blob/647ccf34334f76b7cc668f485143af4e11403c20/cli/command/container/utils.go#L40-L41

but in this case, we return from the goroutine without notifying the caller, which causes the caller to hang.

- How I did it

We should not be cancelling the context used to wait on the container exit/removal in this case, and it's only happening by accident due to this line, which is there for other reasons https://github.com/docker/cli/blob/05cba3f7b97c73d49675a3e5adb1118fe0c79867/cli/command/container/run.go#L196-L199

So I used a new context for the waitExitOrRemoved call.

Also, close the returned channel to unblock any caller when we hit this branch, and added another close to a case down where we added another return without notifying the caller.

- How to verify it

Reproduce such as:

#!/bin/bash
while :; do
    docker run --rm --entrypoint invalidcommand alpine param
done

- Description for the changelog

Fix issue where the CLI process would sometimes hang when a container failed to start.

- A picture of a cute animal (not mandatory but encouraged)

image

laurazard commented 5 months ago

Should also add a test that would catch a regression such as this, but I'll leave that for tomorrow 😅

codecov-commenter commented 5 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 61.07%. Comparing base (647ccf3) to head (8d6e571).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #5060 +/- ## ======================================= Coverage 61.06% 61.07% ======================================= Files 295 295 Lines 20667 20670 +3 ======================================= + Hits 12621 12624 +3 Misses 7147 7147 Partials 899 899 ```
thaJeztah commented 5 months ago

Will this affect client-side handling of --rm (API 1.25 and older)? I noticed that's handled in this area of the code, and I recall there was some rather complex logic around this.

laurazard commented 5 months ago

I don't think it should, that check/path happens before the changes in this patch/the other changes right? https://github.com/docker/cli/blob/647ccf34334f76b7cc668f485143af4e11403c20/cli/command/container/utils.go#L26-L28

laurazard commented 5 months ago

Agree! I'll write a ticket somewhere for us to remember to do that, should look at all the tests and check that.

thaJeztah commented 5 months ago

Thanks! Yes, looks like a tracking ticket won't hurt. In most cases it probably shouldn't be problematic, but it could explain some flakiness (containers from other tests still being around, causing tests to be inconsistent).

In either case, not critical for this PR; I think this PR should be good to go