go.mod: github.com/containerd/containerd v1.7.18, switch to github.com/containerd/errdefs module

thaJeztah commented 5 months ago

follow-up to / stacked on https://github.com/containerd/nydus-snapshotter/pull/598

go.mod: github.com/containerd/containerd v1.7.18

full diff: https://github.com/containerd/containerd/compare/v1.7.7...v1.7.18

switch to github.com/containerd/errdefs module

containerd 1.7.18 and up alias the errdefs package to the new module, and deprecate the package.

thaJeztah commented 5 months ago

Hmm... looks like there's failures in CI related to OTEL; looks like multiple metrics servers are tried to be set up on port 9110;

time="2024-06-26T10:15:05.945761805Z" level=info msg="loading plugin \"io.containerd.internal.v1.tracing\"..." type=io.containerd.internal.v1

...

time="2024-06-26T10:15:05.946779174Z" level=error msg="failed to load cni during init, please check CRI plugin status before setting up network for pods" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"

...

umount globally shared mountpoint
umount: /var/lib/containerd-nydus/mnt: no mount point specified.
ls: cannot access '/run/containerd-nydus/containerd-nydus-grpc.sock': No such file or directory
Fail(1). Retrying...
time="2024-06-26T10:15:08.579292914Z" level=info msg="Start nydus-snapshotter. Version: fba89c3, PID: 143, FsDriver: fusedev, DaemonMode: multiple"
time="2024-06-26T10:15:08.581727603Z" level=info msg="Trying to translate bucket records..."
time="2024-06-26T10:15:08.581798015Z" level=info msg="Trying to update bucket records from v1.0 to v1.1 ..."
time="2024-06-26T10:15:08.582908629Z" level=info msg="parsed cgroup config: cgroup.Config{MemoryLimitInBytes:-1}"
time="2024-06-26T10:15:08.582980705Z" level=info msg="cgroup mode: legacy"
time="2024-06-26T10:15:08.583967256Z" level=info msg="create cgroup (v1) successful, state: thawed"
time="2024-06-26T10:15:08.584895731Z" level=info msg="Run daemons monitor..."
time="2024-06-26T10:15:08.585325775Z" level=fatal msg="failed to start nydus-snapshotter" error="failed to initialize snapshotter: start metrics HTTP server: metrics server listener, addr=:9110: listen tcp :9110: bind: address already in use"
ls: cannot access '/run/containerd-nydus/containerd-nydus-grpc.sock': No such file or directory

thaJeztah commented 5 months ago

Looks related to this code, but I'm wondering now if the snapshotted itself should setup a metrics server, or if that's something that only containerd should do https://github.com/containerd/nydus-snapshotter/blob/71ab64ad42d30c9b1538418ba041df0d1191d4f1/snapshot/snapshot.go#L180-L200

thaJeztah commented 5 months ago

Code above was added in https://github.com/containerd/nydus-snapshotter/commit/fe511b89ba033ab189bc2e4b08c7bcfb50bd132b, which is part of this PR:

https://github.com/containerd/nydus-snapshotter/pull/263

thaJeztah commented 5 months ago

cc @cpuguy83 @dmcgowan perhaps one of you knows?

sctb512 commented 5 months ago

Code above was added in fe511b8, which is part of this PR:

metrics: collect the metrics of nydusd events #263

I think the CI failure is not caused by this patch. It seems the :9110 is already in use.

Can we retry this CI?

sctb512 commented 5 months ago

Can we retry this CI?

Otherwise, we can disable metrics server in E2E test.

sctb512 commented 5 months ago

The reason why the :9110 port is already in use is the old "containerd-nydus-grpc process is not killed successfully.

https://github.com/containerd/nydus-snapshotter/blob/main/integration/entrypoint.sh#L140

thaJeztah commented 5 months ago

Ah! I guess I misinterpreted the cfg.MetricsConfig.Address to be at the "global" level (so for containerd as a whole), but it's for this snapshotter.

The reason why the :9110 port is already in use is the old "containerd-nydus-grpc process is not killed successfully.

Good one; maybe? I don't have permissions to restart Ci on this repo; I can do a quick close & reopen to try, or perhaps you're able to kick only the failing ones.

This failure didn't happen on my other PR though, and I've seen it fail twice (both before and after rebasing), so .. it's still possible there's a regression somewhere (or change in behavior)

thaJeztah commented 5 months ago

Let me just do a close/reopen to get a fresh run

apostasie commented 4 months ago

The reason why the :9110 port is already in use is the old "containerd-nydus-grpc process is not killed successfully.

https://github.com/containerd/nydus-snapshotter/blob/main/integration/entrypoint.sh#L140

Facing the same issue. The process gets killed, but it is slow to exit: https://github.com/containerd/nydus-snapshotter/blob/main/integration/entrypoint.sh#L145

0.5s is not enough on my machine.

2s does work better. YMMV

Anyhow, this is inherently racy - maybe we can replace by an explicit check (curl the endpoint maybe?).

imeoer commented 4 months ago

0.5s is not enough on my machine.

2s does work better. YMMV

Anyhow, this is inherently racy - maybe we can replace by an explicit check (curl the endpoint maybe?).

Thanks for the tip, I will work on this.

apostasie commented 4 months ago

@thaJeztah Sebastiaan, #606 fixes most of the CI issues (except the kubeconf regression) - if you can rebase on it as soon as it merges, that would be really nice.

Would love to see your PR go through ASAP - as it will definitely help with work going on in #604.

thaJeztah commented 4 months ago

Let me already to a quick rebase to get a fresh run ahead of time. @apostasie are there plans to do a release before switching to containerd v2? If so, I can do one more follow-up to get it to containerd v1.7.20 (using the separate API module).

I'm trying to get rid of the "github.com/containerd/containerd/errdefs" package in our dependency tree, and the nydus-snapshotter is one of the last remaining ones using it 😅

apostasie commented 4 months ago

Let me already to a quick rebase to get a fresh run ahead of time. @apostasie are there plans to do a release before switching to containerd v2? If so, I can do one more follow-up to get it to containerd v1.7.20 (using the separate API module).

I'm trying to get rid of the "github.com/containerd/containerd/errdefs" package in our dependency tree, and the nydus-snapshotter is one of the last remaining ones using it 😅

@thaJeztah I am not a maintainer here, so, do not know about timeline for releases.

My interest is in getting nerdctl to containerd v2 ASAP (we have a working build but it is using my forks: https://github.com/containerd/nerdctl/pull/3173 ), so, I am onto getting dependencies to ctd v2 (on main).

I got stargz-snapshotter, data-accelerator/zdfs, and accelerated-image merged. soci accepted the PR but want to wait for ctd v2 official release (https://github.com/awslabs/soci-snapshotter/pull/1305).

nydus and hcsshim are the last on my list.

apostasie commented 3 months ago

@thaJeztah CI fixes got merged. Would you rebase this and get it in? Then I would rebase #604 on top of this.

thaJeztah commented 3 months ago

Rebased 👍 CI looks to be happy now 🎉

Also prepared a follow-up for containerd v1.7.20, which deprecates another package (to help transition to v2); https://github.com/containerd/nydus-snapshotter/pull/610

thaJeztah commented 3 months ago

@imeoer ptal 🤗

thaJeztah commented 3 months ago

Thx! Rebasing https://github.com/containerd/nydus-snapshotter/pull/610 👍

containerd / nydus-snapshotter

go.mod: github.com/containerd/containerd v1.7.18, switch to github.com/containerd/errdefs module #599

go.mod: github.com/containerd/containerd v1.7.18

switch to github.com/containerd/errdefs module