containerd / accelerated-container-image

A production-ready remote container image format (overlaybd) and snapshotter based on block-device.
Apache License 2.0
409 stars 75 forks source link

Overlaybd did not block when networking was not available for long time #214

Open shuaichang opened 1 year ago

shuaichang commented 1 year ago

What happened in your environment?

We found a potential overlaybd bug that it returned incorrect data during networking was down. This could lead to application failures, in our case is Java failed to load class

What did you expect to happen?

When networking is down, the class loading should be completely blocked until the network recovers. However, we currently see "Exception: java.lang.NoClassDefFoundError" and " error reading zip file" after retrying for 3+ minutes.

We suspect there's a bug in overlaybd that it returned some unexpected result but instead it should block until networking is recovered. given the following experiments we did:

  1. We did systemctl stop overlaybd-tcmu, after which jar command would actually hang forever until overlaybd-tcmu recover
  2. With a normal jar stored on a device-mapper block device, if we suspend the IO in the DM device, the jar command would hang forever until the IO suspension was removed

How can we reproduce it?

/opt/overlaybd/snapshotter/ctr -n k8s.io rpull -u $USERNAME:$PASSWORD $IMAGE_REF

ctr -n k8s.io run --snapshotter=overlaybd --rm -t $IMAGE_REF test-jar bash

# In side the shell, run `jar` command to load the binary

After several minutes, we see "error reading zip file" error

root@ip-10-0-0-134:/# jar vft ./example.java.helloworld/Main.jar /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/rt.jar: error reading zip file /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/rt.jar: error reading zip file Exception in thread "main" Exception: java.lang.NoClassDefFoundError thrown from the UncaughtExceptionHandler in thread "main"



### What is the version of your Accelerated Container Image?

* overlaybd 0.6.10

### What is your OS environment?

ubuntu

### Are you willing to submit PRs to fix it?

- [ ] Yes, I am willing to fix it.
shuaichang commented 1 year ago

Also just to add some more info, per suggested by @liulanzheng offline, the following diff + overlaybd rebuild fixed the issue

image
shuaichang commented 1 year ago

Verified that 0.6.12 fixed the issue, please feel free to close the issue, thank you very much @liulanzheng for making such a fix!