Closed thaJeztah closed 3 months ago
Hmm... looks like there's failures in CI related to OTEL; looks like multiple metrics servers are tried to be set up on port 9110;
time="2024-06-26T10:15:05.945761805Z" level=info msg="loading plugin \"io.containerd.internal.v1.tracing\"..." type=io.containerd.internal.v1
...
time="2024-06-26T10:15:05.946779174Z" level=error msg="failed to load cni during init, please check CRI plugin status before setting up network for pods" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
...
umount globally shared mountpoint
umount: /var/lib/containerd-nydus/mnt: no mount point specified.
ls: cannot access '/run/containerd-nydus/containerd-nydus-grpc.sock': No such file or directory
Fail(1). Retrying...
time="2024-06-26T10:15:08.579292914Z" level=info msg="Start nydus-snapshotter. Version: fba89c3, PID: 143, FsDriver: fusedev, DaemonMode: multiple"
time="2024-06-26T10:15:08.581727603Z" level=info msg="Trying to translate bucket records..."
time="2024-06-26T10:15:08.581798015Z" level=info msg="Trying to update bucket records from v1.0 to v1.1 ..."
time="2024-06-26T10:15:08.582908629Z" level=info msg="parsed cgroup config: cgroup.Config{MemoryLimitInBytes:-1}"
time="2024-06-26T10:15:08.582980705Z" level=info msg="cgroup mode: legacy"
time="2024-06-26T10:15:08.583967256Z" level=info msg="create cgroup (v1) successful, state: thawed"
time="2024-06-26T10:15:08.584895731Z" level=info msg="Run daemons monitor..."
time="2024-06-26T10:15:08.585325775Z" level=fatal msg="failed to start nydus-snapshotter" error="failed to initialize snapshotter: start metrics HTTP server: metrics server listener, addr=:9110: listen tcp :9110: bind: address already in use"
ls: cannot access '/run/containerd-nydus/containerd-nydus-grpc.sock': No such file or directory
Looks related to this code, but I'm wondering now if the snapshotted itself should setup a metrics server, or if that's something that only containerd should do https://github.com/containerd/nydus-snapshotter/blob/71ab64ad42d30c9b1538418ba041df0d1191d4f1/snapshot/snapshot.go#L180-L200
Code above was added in https://github.com/containerd/nydus-snapshotter/commit/fe511b89ba033ab189bc2e4b08c7bcfb50bd132b, which is part of this PR:
cc @cpuguy83 @dmcgowan perhaps one of you knows?
Code above was added in fe511b8, which is part of this PR:
I think the CI failure is not caused by this patch. It seems the :9110
is already in use.
Can we retry this CI?
Can we retry this CI?
Otherwise, we can disable metrics server in E2E test.
The reason why the :9110
port is already in use is the old "containerd-nydus-grpc
process is not killed successfully.
https://github.com/containerd/nydus-snapshotter/blob/main/integration/entrypoint.sh#L140
Ah! I guess I misinterpreted the cfg.MetricsConfig.Address
to be at the "global" level (so for containerd as a whole), but it's for this snapshotter.
The reason why the :9110 port is already in use is the old "containerd-nydus-grpc process is not killed successfully.
Good one; maybe? I don't have permissions to restart Ci on this repo; I can do a quick close & reopen to try, or perhaps you're able to kick only the failing ones.
This failure didn't happen on my other PR though, and I've seen it fail twice (both before and after rebasing), so .. it's still possible there's a regression somewhere (or change in behavior)
Let me just do a close/reopen to get a fresh run
The reason why the
:9110
port is already in use is the old"containerd-nydus-grpc
process is not killed successfully.https://github.com/containerd/nydus-snapshotter/blob/main/integration/entrypoint.sh#L140
Facing the same issue. The process gets killed, but it is slow to exit: https://github.com/containerd/nydus-snapshotter/blob/main/integration/entrypoint.sh#L145
0.5s is not enough on my machine.
2s does work better. YMMV
Anyhow, this is inherently racy - maybe we can replace by an explicit check (curl the endpoint maybe?).
0.5s is not enough on my machine.
2s does work better. YMMV
Anyhow, this is inherently racy - maybe we can replace by an explicit check (curl the endpoint maybe?).
Thanks for the tip, I will work on this.
@thaJeztah Sebastiaan, #606 fixes most of the CI issues (except the kubeconf regression) - if you can rebase on it as soon as it merges, that would be really nice.
Would love to see your PR go through ASAP - as it will definitely help with work going on in #604.
Let me already to a quick rebase to get a fresh run ahead of time. @apostasie are there plans to do a release before switching to containerd v2? If so, I can do one more follow-up to get it to containerd v1.7.20 (using the separate API module).
I'm trying to get rid of the "github.com/containerd/containerd/errdefs" package in our dependency tree, and the nydus-snapshotter is one of the last remaining ones using it π
Let me already to a quick rebase to get a fresh run ahead of time. @apostasie are there plans to do a release before switching to containerd v2? If so, I can do one more follow-up to get it to containerd v1.7.20 (using the separate API module).
I'm trying to get rid of the "github.com/containerd/containerd/errdefs" package in our dependency tree, and the nydus-snapshotter is one of the last remaining ones using it π
@thaJeztah I am not a maintainer here, so, do not know about timeline for releases.
My interest is in getting nerdctl to containerd v2 ASAP (we have a working build but it is using my forks: https://github.com/containerd/nerdctl/pull/3173 ), so, I am onto getting dependencies to ctd v2 (on main).
I got stargz-snapshotter, data-accelerator/zdfs, and accelerated-image merged. soci accepted the PR but want to wait for ctd v2 official release (https://github.com/awslabs/soci-snapshotter/pull/1305).
nydus and hcsshim are the last on my list.
@thaJeztah CI fixes got merged. Would you rebase this and get it in? Then I would rebase #604 on top of this.
Rebased π CI looks to be happy now π
Also prepared a follow-up for containerd v1.7.20, which deprecates another package (to help transition to v2); https://github.com/containerd/nydus-snapshotter/pull/610
@imeoer ptal π€
Thx! Rebasing https://github.com/containerd/nydus-snapshotter/pull/610 π
go.mod: github.com/containerd/containerd v1.7.18
full diff: https://github.com/containerd/containerd/compare/v1.7.7...v1.7.18
switch to github.com/containerd/errdefs module
containerd 1.7.18 and up alias the errdefs package to the new module, and deprecate the package.