containerd / nerdctl

contaiNERD CTL - Docker-compatible CLI for containerd, with support for Compose, Rootless, eStargz, OCIcrypt, IPFS, ...
Apache License 2.0
7.97k stars 595 forks source link

Retrying failed/interruped commit gives error #1421

Open 2000yeshu opened 1 year ago

2000yeshu commented 1 year ago

Description

When we commit a contianer and the commit is interrupted and when we retry the commit we see the error failed to export layer: snapshot \"sha256:5ca359d74c4d65ad7dc2fc1013b3cccfa921f9000395ac846fd06a37c9a1a67e-parent-view\": already exists. I think this happens because there is no garbage collection of the bolt KVs and the error is thrown here in containerd's bolt module

Steps to reproduce the issue

  1. ctr container create docker.io/library/ubuntu:20.04 my-ubuntu
  2. sudo ctr task start my-ubuntu
  3. sudo nerdctl container exec -it my-ubuntu bash
  4. fallocate -l 50000000K test.txt
  5. sudo nerdctl commit my-ubuntu my-ubuntu-commited
  6. SIGINT(Ctrl+C)
  7. sudo nerdctl commit my-ubuntu my-ubuntu-commited.
yakul@yeshu:~$ sudo nerdctl commit my-ubuntu my-ubuntu-commited
WARN[0000] Image lacks label "nerdctl/platform", assuming the platform to be "linux/amd64" 
^C
yakul@yeshu:~$ sudo nerdctl commit my-ubuntu my-ubuntu-commited
WARN[0000] Image lacks label "nerdctl/platform", assuming the platform to be "linux/amd64" 
FATA[0000] failed to export layer: snapshot "sha256:cdca8156a203b9719f985c3114336529115cdc392f89d45cfcd37c968ddd3645-parent-view": already exists

Describe the results you received and expected

Recieved: Container stuck in PAUSED state and cannot not be committed in the second attempt.

Expected: Either the container should be successfully committed on second attempt or if not, it should fallback to RUNNING state.

What version of nerdctl are you using?

Client: Version: v0.20.0 OS/Arch: linux/amd64 Git commit: e77e05b5fd252274e3727e0439e9a2d45622ccb9

Server: containerd: Version: 1.6.6 GitCommit: 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1

What version of ctr are you using?

Client: Version: 1.6.6 Revision: 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1 Go version: go1.17.11

Are you using a variant of nerdctl? (e.g., Rancher Desktop)

No response

Host information

No response

Zheaoli commented 1 year ago

I can not reproduce the bug, would you mind giving us more specific reproduce steps?

2000yeshu commented 1 year ago

I have updated the issue description with steps I followed to reproduce the bug.

2000yeshu commented 1 year ago

I feel there should be signal handlers on the commit context so as to delete any garbage keys created by bolt in case a commit was interrupted/failed before completion.

2000yeshu commented 1 year ago

Opened PR in containerd in relation to this

lujinda commented 2 weeks ago

I also encountered the same problem. As long as nerdctl can correctly close the context when receiving the SIGETRM or SIGINT signal, most problems can be solved. In the nerdctl Commit function, if the context is closed, there is a chance to release the Lease.

In addition, the interruption of the commit command will also cause the container that was originally RUNNING to remain in PAUSE, which may also be a problem.