dragonflyoss / Dragonfly2

Dragonfly is an open source P2P-based file distribution and image acceleration system. It is hosted by the Cloud Native Computing Foundation (CNCF) as an Incubating Level Project.
https://d7y.io
Apache License 2.0
2.26k stars 287 forks source link

No traces when pulling images from docker #2728

Open npitsillos opened 1 year ago

npitsillos commented 1 year ago

Bug report:

I deployed dragonfly on AWS EKS and set 3 registries. Refer to the `values.yml' file below.

containerRuntime:
  containerd:
    enable: true
    registries:
      - "https://registry.hubble.jina.ai"
      - "https://253352124568.dkr.ecr.us-east-2.amazonaws.com"
      - "https://docker.io"

manager:
  metrics:
    enable: true

scheduler:
  nodeSelector:
    karpenter.sh/provisioner-name: system
  metrics:
    enable: true

seedPeer:
  nodeSelector:
    karpenter.sh/provisioner-name: system
  metrics:
    enable: true
  persistence:
    storageClass: "ebs-sc"

dfdaemon:
  metrics:
    enable: true

redis:
  enable: true
  global:
    storageClass: "ebs-sc"

mysql:
  enable: true
  global:
    storageClass: "ebs-sc"

jaeger:
  enable: true

For each registry I have run gen-host.sh <host> and setup /etc/containerd/config.toml as shown here where certs.d contains a dir for each host with the hosts.toml file.

version = 2

[plugins."io.containerd.grpc.v1.cri".registry]
  config_path = "/etc/containerd/certs.d"

When pulling images from docker no traces appear in jaeger and no logs show when running kubectl -n dragonfly-system exec -it pod-name -- grep "peer task done" /var/log/dragonfly/daemon/core.log. I am assuming the pod here should be the one that runs in the same node as the one where the image is pulled.

Here are the logs from dfdaemon pod update-containerd container

+ etcContainerd=/host/etc/containerd
+ '[[' -e /host/etc/containerd/config.toml ]]
+ echo containerd config found
+ cat /host/etc/containerd/config.toml
containerd config found
version = 2
root = "/var/lib/containerd"
state = "/run/containerd"

[grpc]
address = "/run/containerd/containerd.sock"

[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "runc"
discard_unpacked_layers = true

[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5"

[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d:/etc/docker/certs.d"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true

[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"
+ registries='https://registry.hubble.jina.ai https://253352124568.dkr.ecr.us-east-2.amazonaws.com https://index.docker.io'
+ '[[' -n  ]]
+ need_restart=0
+ grep 'version[^=]*=[^2]*2' /host/etc/containerd/config.toml
version = 2
+ cat /host/etc/containerd/config.toml
+ grep config_path
+ awk '{print $3}'
+ tr '"' ' '
config_path is enabled, add mirror in /etc/containerd/certs.d:/etc/docker/certs.d
+ config_path=/etc/containerd/certs.d:/etc/docker/certs.d
+ '[[' -z /etc/containerd/certs.d:/etc/docker/certs.d ]]
+ echo config_path is enabled, add mirror 'in' /etc/containerd/certs.d:/etc/docker/certs.d
+ cat /host/etc/containerd/config.toml
+ awk '{print $3}'
+ tr '"' ' '
+ grep config_path
+ tmp=/etc/containerd/certs.d:/etc/docker/certs.d
+ '[[' -z /etc/containerd/certs.d:/etc/docker/certs.d ]]
+ mkdir -p /host/etc/containerd/certs.d
+ echo https://registry.hubble.jina.ai
+ sed -e 's,http.*://,,'
+ sed 's,:.*,,'
+ domain=registry.hubble.jina.ai
+ mkdir -p /host/etc/containerd/certs.d/registry.hubble.jina.ai
registry https://registry.hubble.jina.ai found in config.toml, skip
+ '[[' -e /host/etc/containerd/certs.d/registry.hubble.jina.ai/hosts.toml ]]
+ echo 'registry https://registry.hubble.jina.ai found in config.toml, skip'
+ continue
+ sed -e 's,http.*://,,'
+ echo https://253352124568.dkr.ecr.us-east-2.amazonaws.com
+ sed 's,:.*,,'
+ domain=253352124568.dkr.ecr.us-east-2.amazonaws.com
+ mkdir -p /host/etc/containerd/certs.d/253352124568.dkr.ecr.us-east-2.amazonaws.com
+ '[[' -e /host/etc/containerd/certs.d/253352124568.dkr.ecr.us-east-2.amazonaws.com/hosts.toml ]]
+ echo 'registry https://253352124568.dkr.ecr.us-east-2.amazonaws.com found in config.toml, skip'
registry https://253352124568.dkr.ecr.us-east-2.amazonaws.com found in config.toml, skip
+ continue
+ echo https://index.docker.io
+ sed -e 's,http.*://,,'
+ sed 's,:.*,,'
+ domain=index.docker.io
+ mkdir -p /host/etc/containerd/certs.d/index.docker.io
registry https://index.docker.io found in config.toml, skip
+ '[[' -e /host/etc/containerd/certs.d/index.docker.io/hosts.toml ]]
+ echo 'registry https://index.docker.io found in config.toml, skip'
+ continue
+ '[[' 0 -gt 0 ]]

Expected behavior:

Docker images should be pulled through dragonfly

How to reproduce it:

Deploy dragonfly with docker registry in mirror mode.

Environment:

npitsillos commented 1 year ago

After looking further into this I found that traces for pulling images from docker appear when the registry is set as follows in /etc/containerd/certs.d/docker.io

server = "https://docker.io"
[host."http://127.0.0.1:65001"]
  capabilities = ["pull", "resolve"]
  [host."http://127.0.0.1:65001".header]
  X-Dragonfly-Registry = ["https://registry-1.docker.io"]
[host."https://registry-1.docker.io"]
  capabilities = ["pull", "resolve"]

Having set this in every node running the dfdaemon daemonset I still don't see any speed up on pulling images. How can I read the traces to understand if I have correctly configured the mirror mode?

PKizzle commented 1 year ago

I have setup Prometheus to keep track of the metrics. If an image is pulled via Dragonfly you should see an increasing number of requests. However, I could not get Dragonfly to work as well. While dfget works perfectly fine inside the dfdaemon pod images seem to be pulled normally without any P2P acceleration 😢

npitsillos commented 1 year ago

It seems to be the case for me as well @PKizzle. I checked the logs in the seed-peer seems like images are pulled from a single source within the P2P network which could explain the lack of speed up.