containerd / nydus-snapshotter

A containerd snapshotter with data deduplication and lazy loading in P2P fashion
https://nydus.dev/
Apache License 2.0
156 stars 88 forks source link

go.mod: github.com/containerd/containerd v1.7.18, switch to github.com/containerd/errdefs module #599

Open thaJeztah opened 1 week ago

thaJeztah commented 1 week ago

go.mod: github.com/containerd/containerd v1.7.18

full diff: https://github.com/containerd/containerd/compare/v1.7.7...v1.7.18

switch to github.com/containerd/errdefs module

containerd 1.7.18 and up alias the errdefs package to the new module, and deprecate the package.

thaJeztah commented 1 week ago

Hmm... looks like there's failures in CI related to OTEL; looks like multiple metrics servers are tried to be set up on port 9110;

time="2024-06-26T10:15:05.945761805Z" level=info msg="loading plugin \"io.containerd.internal.v1.tracing\"..." type=io.containerd.internal.v1

...

time="2024-06-26T10:15:05.946779174Z" level=error msg="failed to load cni during init, please check CRI plugin status before setting up network for pods" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"

...

umount globally shared mountpoint
umount: /var/lib/containerd-nydus/mnt: no mount point specified.
ls: cannot access '/run/containerd-nydus/containerd-nydus-grpc.sock': No such file or directory
Fail(1). Retrying...
time="2024-06-26T10:15:08.579292914Z" level=info msg="Start nydus-snapshotter. Version: fba89c3, PID: 143, FsDriver: fusedev, DaemonMode: multiple"
time="2024-06-26T10:15:08.581727603Z" level=info msg="Trying to translate bucket records..."
time="2024-06-26T10:15:08.581798015Z" level=info msg="Trying to update bucket records from v1.0 to v1.1 ..."
time="2024-06-26T10:15:08.582908629Z" level=info msg="parsed cgroup config: cgroup.Config{MemoryLimitInBytes:-1}"
time="2024-06-26T10:15:08.582980705Z" level=info msg="cgroup mode: legacy"
time="2024-06-26T10:15:08.583967256Z" level=info msg="create cgroup (v1) successful, state: thawed"
time="2024-06-26T10:15:08.584895731Z" level=info msg="Run daemons monitor..."
time="2024-06-26T10:15:08.585325775Z" level=fatal msg="failed to start nydus-snapshotter" error="failed to initialize snapshotter: start metrics HTTP server: metrics server listener, addr=:9110: listen tcp :9110: bind: address already in use"
ls: cannot access '/run/containerd-nydus/containerd-nydus-grpc.sock': No such file or directory
thaJeztah commented 1 week ago

Looks related to this code, but I'm wondering now if the snapshotted itself should setup a metrics server, or if that's something that only containerd should do https://github.com/containerd/nydus-snapshotter/blob/71ab64ad42d30c9b1538418ba041df0d1191d4f1/snapshot/snapshot.go#L180-L200

thaJeztah commented 1 week ago

Code above was added in https://github.com/containerd/nydus-snapshotter/commit/fe511b89ba033ab189bc2e4b08c7bcfb50bd132b, which is part of this PR:

thaJeztah commented 1 week ago

cc @cpuguy83 @dmcgowan perhaps one of you knows?

sctb512 commented 1 week ago

Code above was added in fe511b8, which is part of this PR:

I think the CI failure is not caused by this patch. It seems the :9110 is already in use.

Can we retry this CI?

sctb512 commented 1 week ago

Can we retry this CI?

Otherwise, we can disable metrics server in E2E test.

sctb512 commented 1 week ago

The reason why the :9110 port is already in use is the old "containerd-nydus-grpc process is not killed successfully.

https://github.com/containerd/nydus-snapshotter/blob/main/integration/entrypoint.sh#L140

thaJeztah commented 1 week ago

Ah! I guess I misinterpreted the cfg.MetricsConfig.Address to be at the "global" level (so for containerd as a whole), but it's for this snapshotter.

The reason why the :9110 port is already in use is the old "containerd-nydus-grpc process is not killed successfully.

Good one; maybe? I don't have permissions to restart Ci on this repo; I can do a quick close & reopen to try, or perhaps you're able to kick only the failing ones.

This failure didn't happen on my other PR though, and I've seen it fail twice (both before and after rebasing), so .. it's still possible there's a regression somewhere (or change in behavior)

thaJeztah commented 1 week ago

Let me just do a close/reopen to get a fresh run