dragonflyoss / Dragonfly2

Dragonfly is an open source P2P-based file distribution and image acceleration system. It is hosted by the Cloud Native Computing Foundation (CNCF) as an Incubating Level Project.
https://d7y.io
Apache License 2.0
2.28k stars 292 forks source link

No resources sharing observed in peers with helm based release #k8s #1344

Open nitinpatil1992 opened 2 years ago

nitinpatil1992 commented 2 years ago

Bug report:

We have deployed the dragonfly on containerd based host machines using helm.

~ kgp -n dragonfly-system
NAME                                      READY   STATUS    RESTARTS   AGE
dragonfly-dfdaemon-2mhvp                  1/1     Running   4          5h48m
dragonfly-dfdaemon-jpsz8                  1/1     Running   4          5h48m
dragonfly-dfdaemon-lczff                  1/1     Running   4          5h48m
dragonfly-dfdaemon-lpdxq                  1/1     Running   4          5h48m
dragonfly-dfdaemon-qshhn                  1/1     Running   4          5h48m
dragonfly-dfdaemon-svgjj                  1/1     Running   3          5h48m
dragonfly-dfdaemon-wfwd2                  1/1     Running   4          5h48m
dragonfly-manager-5794bdfff-d6hzv         1/1     Running   0          2d15h
dragonfly-manager-5794bdfff-m244v         1/1     Running   0          2d15h
dragonfly-manager-5794bdfff-vzj6w         1/1     Running   0          2d15h
dragonfly-mysql-688dc67dcf-28fg6          1/1     Running   0          2d15h
dragonfly-redis-master-654c7d645b-mm29v   1/1     Running   0          2d15h
dragonfly-scheduler-0                     1/1     Running   0          2d15h
dragonfly-scheduler-1                     1/1     Running   0          5h45m
dragonfly-scheduler-2                     1/1     Running   0          5h34m
dragonfly-seed-peer-0                     1/1     Running   3          2d15h
dragonfly-seed-peer-1                     1/1     Running   0          5h43m
dragonfly-seed-peer-2                     1/1     Running   0          5h45m

But when we pull the image on the one of the box, the sibling box doesn't appear to have pulled the image.

# ctr image pull docker.io/library/alpine:3.9
docker.io/library/alpine:3.9:                                                     resolved       |++++++++++++++++++++++++++++++++++++++|
index-sha256:414e0518bb9228d35e4cd5165567fb91d26c6a214e9c95899e1e056fcd349011:    done           |++++++++++++++++++++++++++++++++++++++|
manifest-sha256:65b3a80ebe7471beecbc090c5b2cdd0aafeaefa0715f8f12e40dc918a3a70e32: done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:31603596830fc7e56753139f9c2c6bd3759e48a850659506ebfb885d1cf3aef5:    done           |++++++++++++++++++++++++++++++++++++++|
config-sha256:78a2ce922f8665f5a227dc5cd9fda87221acba8a7a952b9665f99bc771a29963:   done           |++++++++++++++++++++++++++++++++++++++|
elapsed: 2.2 s                                                                    total:  3.6 Ki (1.6 KiB/s)
unpacking linux/amd64 sha256:414e0518bb9228d35e4cd5165567fb91d26c6a214e9c95899e1e056fcd349011...
done
~ k exec -it dragonfly-dfdaemon-jpsz8  -n dragonfly-system -c dfdaemon -- grep "peer task done" /var/log/dragonfly/daemon/core.log

Here is the config for docker daemon

# cat /etc/containerd/config.toml
version = 2
root = "/var/lib/containerd"
state = "/run/containerd"

[grpc]
address = "/run/containerd/containerd.sock"

[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "runc"

[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "<aws-registry>/eks/pause:3.1-eksbuild.1"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"
[plugins."io.containerd.grpc.v1.cri".registry]
  config_path = "/etc/containerd/certs.d"

No logs for peer pulling the images in daemonsets

Expected behavior:

Logs need to there when grepped with peer task done

grep "peer task done" /var/log/dragonfly/daemon/core.log

How to reproduce it:

  1. Deploy helm chart on eks environment
  2. wait for dragonfly resources to run
  3. grep core logs in daemonset

Environment:

czomo commented 2 years ago

Unfortunately, I am facing the same issue. Mentioned it here

jim3ma commented 2 years ago

Can you paste the files in /etc/containerd/certs.d ? This directory contains image registries mirror configruation.

Example: https://d7y.io/docs/setup/runtime/containerd/mirror#option-2-multiple-registries

For docker.io,

/etc/containerd/certs.d/docker.io/hosts.toml

server = "https://index.docker.io"

[host."http://127.0.0.1:65001"]
  capabilities = ["pull"]
  [host."http://127.0.0.1:65001".header]
    X-Dragonfly-Registry = ["https://index.docker.io"]
czomo commented 2 years ago

How about single-registry option > Version 2 config without config_path? Is it supported? In my case there is nothing under /etc/containerd/ other than config.toml(config-kops.yaml)

jim3ma commented 2 years ago

How about single-registry option > Version 2 config without config_path? Is it supported? In my case there is nothing under /etc/containerd/ other than config.toml(config-kops.yaml)

Yes, follow this https://d7y.io/docs/setup/runtime/containerd/mirror/#option-1-single-registry

czomo commented 2 years ago

I did that. The effects are similar to what @nitinpatil1992 wrote. Also deployed with helm Here is my config.toml.

version = 2

[plugins]

  [plugins."io.containerd.grpc.v1.cri"]
    sandbox_image = "k8s.gcr.io/pause:3.6"

    [plugins."io.containerd.grpc.v1.cri".containerd]

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."registry-1.docker.io"]
  endpoint = ["http://127.0.0.1:65001","https://registry-1.docker.io"]
Dragonfly version: v2.0.2/v2.0.3 
OS: Ubuntu 20.04.3 LTS 
Kernel (e.g. uname -a): 5.11.0-1021-aws #22~20.04.2-Ubuntu SMP Wed Oct 27 21:27:13 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Other: containerd://1.4.12

My Helm values:

containerRuntime:
  containerd:
    enable: true
    configFileName: "config-kops.toml"
manager:
  ingress:
    enable: true
    className: private
    hosts:
      - "dragonfly.example.com"
    tls:
      - secretName: secure-tls
        hosts:
          - "dragonfly.example.com"
cdn:
  enable: true

Is there anything I can provide to redirect us to correct path?

jim3ma commented 2 years ago

Did you restart the containerd daemon ?

jim3ma commented 2 years ago

In https://github.com/containerd/containerd/blob/main/docs/cri/registry.md, mirror config :

[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
...

not registry-1.docker.com

czomo commented 2 years ago

Did you restart the containerd daemon ?

Yeah, it is done by helm charts itself

            if [[ "$need_restart" -gt 0 ]]; then
              nsenter -t 1 -m systemctl -- restart containerd.service
            fi

https://github.com/dragonflyoss/helm-charts/blob/b0bd87eeecb56da480161b8ba491acc3573be835/charts/dragonfly/templates/dfdaemon/dfdaemon-daemonset.yaml#L413

not registry-1.docker.com

Bingo! It seems that nodes started to exchanging blobs. :man_facepalming: Now I need to set up auth by providing docker credencials.

Regerding missing peer task done. How about documenting it better, in the official tutorial for helm chart? I can add a note in here Wdyt @jim3ma ?

jim3ma commented 2 years ago

It seems that the contaienrd did not restart. You can check the logs of container update-containerd in dfdaemon.

greenhandatsjtu commented 2 years ago

I met similar issue and turned to follow Containerd > Version 2 config with config_path instructions to setup registry, then it works well. This is what my /etc/containerd/certs.d/docker.io/hosts.toml looks like:

server = "https://registry-1.docker.io"
[host."http://localhost:65001"]
  capabilities = ["pull"]
  skip_verify = true

Then I pull image using ctr specifying hosts-dir:

ctr images pull --hosts-dir "/etc/containerd/certs.d" docker.io/library/alpine:latest

When pull finished, I can find related logs in dfdaemon: image

This issue comment may be helpful: https://github.com/containerd/containerd/issues/5407#issuecomment-825322092

nitinpatil1992 commented 2 years ago

@jim3ma here is out certs.d looks like

ls -al /etc/containerd/certs.d
total 0
drwxr-xr-x 5 root root 62 Jun  9 09:52 .
drwxr--r-- 3 root root 40 Jun  9 09:52 ..
drwxr-xr-x 2 root root 24 Jun  9 09:52 ghcr.io
drwxr-xr-x 2 root root 24 Jun  9 09:52 harbor.example.com
drwxr-xr-x 2 root root 24 Jun  9 09:52 quay.io
$ cat /etc/containerd/certs.d/quay.io/hosts.toml
server = "https://quay.io"
[host."http://127.0.0.1:65001"]
  capabilities = ["pull", "resolve"]
  [host."http://127.0.0.1:65001".header]
  X-Dragonfly-Registry = ["https://quay.io"]

@czomo can you please share your full containerd config? Also, did you just use the localhost endpoint to pull image or actual docker host name?

nitinpatil1992 commented 2 years ago

Alson noticed the dfget config under deamon, the download settings has port 65000 but couldn't find out where this is being exposed/used.

download:
  calculateDigest: true
  downloadGRPC:
    security:
      insecure: true
    unixListen:
      socket: /tmp/dfdamon.sock
  peerGRPC:
    security:
      insecure: true
    tcpListen:
      listen: 0.0.0.0
      port: 65000
  perPeerRateLimit: 100Mi
  totalRateLimit: 200Mi
czomo commented 2 years ago

@jim3ma here is out certs.d looks like

ls -al /etc/containerd/certs.d
total 0
drwxr-xr-x 5 root root 62 Jun  9 09:52 .
drwxr--r-- 3 root root 40 Jun  9 09:52 ..
drwxr-xr-x 2 root root 24 Jun  9 09:52 ghcr.io
drwxr-xr-x 2 root root 24 Jun  9 09:52 harbor.example.com
drwxr-xr-x 2 root root 24 Jun  9 09:52 quay.io
$ cat /etc/containerd/certs.d/quay.io/hosts.toml
server = "https://quay.io"
[host."http://127.0.0.1:65001"]
  capabilities = ["pull", "resolve"]
  [host."http://127.0.0.1:65001".header]
  X-Dragonfly-Registry = ["https://quay.io"]

@czomo can you please share your full containerd config? Also, did you just use the localhost endpoint to pull image or actual docker host name?

I am using containerd 1.4.12(1.5+ have slightly different structure) hence there is no hosts.toml/certs.d and I am restricted to mirror only one registry. This is how looks like my final and full config. Works however I am hitting pulling limit(~35 nodes - 5k pods). Will be working on adding auth to it in following week

version = 2

[plugins]

  [plugins."io.containerd.grpc.v1.cri"]
    sandbox_image = "k8s.gcr.io/pause:3.6"

    [plugins."io.containerd.grpc.v1.cri".containerd]

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
  endpoint = ["http://127.0.0.1:65001","https://docker.io"]

Also, did you just use the localhost endpoint to pull image or actual docker host name?

Not sure about this one but rather the localhost 127.0.0.1:65001 as above

TomasKohout commented 1 year ago

dragonfly version: 2.0.7 helm chart: 0.8.7

Don't know if I hit the same issue, but I was able to make image pull work for private registry, but unfortunately, the tasks are not distributed across dfdaemon agents. Peer tasks only occur in the dfdaemon agent where I trigger the pull via crictl and I'm bitterly stuck with this.

My config for containerd:

[plugins."io.containerd.grpc.v1.cri".registry.configs."127.0.0.1:65001".auth]
  auth = "********"
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."my-private-registry.example.com"]
  endpoint = ["http://127.0.0.1:65001","https://my-private-registry.example.com"]

dfdaemon conf:

 aliveTime: 0s
gcInterval: 1m0s
keepStorage: false
workHome: 
logDir: 
cacheDir: 
pluginDir: 
dataDir: /var/lib/dragonfly
console: true
health:
  path: /server/ping
  tcpListen:
    port: 40901
verbose: false
jaeger: http://dragonfly-jaeger-collector.dragonfly-system.svc.cluster.local:14268/api/traces
scheduler:
  manager:
    enable: true
    netAddrs:
      - addr: dragonfly-manager.dragonfly-system.svc.cluster.local:65003
        type: tcp
    refreshInterval: 5m
  scheduleTimeout: 30s
  disableAutoBackSource: false
  seedPeer:
    clusterID: 1
    enable: false
    type: super
host:
  idc: ""
  location: ""
  netTopology: ""
  securityDomain: ""
download:
  calculateDigest: true
  concurrent:
    goroutineCount: 10
    initBackoff: 0.5
    maxAttempts: 3
    maxBackoff: 3
    thresholdSize: 100M
    thresholdSpeed: 200M
  downloadGRPC:
    security:
      insecure: true
      tlsVerify: false
    unixListen:
      socket: /run/dragonfly/dfdaemon.sock
  peerGRPC:
    security:
      insecure: true
    tcpListen:
      port: 65000
  perPeerRateLimit: 512Mi
  prefetch: false
  totalRateLimit: 1024Mi
upload:
  rateLimit: 1024Mi
  security:
    insecure: true
    tlsVerify: false
  tcpListen:
    port: 65002
objectStorage:
  enable: false
  filter: Expires&Signature&ns
  maxReplicas: 3
  security:
    insecure: true
    tlsVerify: true
  tcpListen:
    port: 65004
storage:
  diskGCThreshold: 50Gi
  multiplex: true
  strategy: io.d7y.storage.v2.simple
  taskExpireTime: 6h
proxy:
  defaultFilter: Expires&Signature&ns
  defaultTag: 
  tcpListen:
    port: 65001
  security:
    insecure: true
    tlsVerify: false
  registryMirror:
    dynamic: false
    insecure: false
    url: https://my-private-registry.example.com
  proxies:
    - regx: blobs/sha256.*
security:
  autoIssueCert: false
  caCert: ""
  certSpec:
    validityPeriod: 4320h
  tlsPolicy: prefer
  tlsVerify: false
network:
  enableIPv6: false
announcer:
  schedulerInterval: 30s

scheduler conf:

server:
  port: 8002
  workHome: 
  logDir: 
  cacheDir: 
  pluginDir: 
  dataDir: 
scheduler:
  algorithm: default
  backSourceCount: 3
  gc:
    hostGCInterval: 1h
    peerGCInterval: 10s
    peerTTL: 24h
    taskGCInterval: 30m
  retryBackSourceLimit: 5
  retryInterval: 50ms
  retryLimit: 10
dynconfig:
  refreshInterval: 10s
  type: manager
host:
  idc: ""
  location: ""
  netTopology: ""
manager:
  addr: dragonfly-manager.dragonfly-system.svc.cluster.local:65003
  schedulerClusterID: 1
  keepAlive:
    interval: 5s
seedPeer:
  enable: true
job:
  redis:
    addrs:
    - dragonfly-redis-master.dragonfly-system.svc.cluster.local:6379
    host: dragonfly-redis-master.dragonfly-system.svc.cluster.local
    port: 6379
    password: dragonfly
storage:
  bufferSize: 100
  maxBackups: 10
  maxSize: 100
security:
  autoIssueCert: false
  caCert: ""
  certSpec:
    validityPeriod: 4320h
  tlsPolicy: prefer
  tlsVerify: false
network:
  enableIPv6: false
metrics:
  enable: false
  addr: ":8000"
  enablePeerHost: false
console: true
verbose: false
jaeger: http://dragonfly-jaeger-collector.dragonfly-system.svc.cluster.local:14268/api/traces

manager conf:

 server:
  rest:
    addr: :8080
  grpc:
    port:
      start: 65003
      end: 65003
  workHome: 
  logDir: 
  cacheDir: 
  pluginDir: 
database:
  mysql:
    user: dragonfly
    password: dragonfly
    host: dragonfly-mysql.dragonfly-system.svc.cluster.local
    port: 3306
    dbname: manager
    migrate: true
  redis:
    addrs:
    - dragonfly-redis-master.dragonfly-system.svc.cluster.local:6379
    host: dragonfly-redis-master.dragonfly-system.svc.cluster.local
    port: 6379
    password: dragonfly
cache:
  local:
    size: 10000
    ttl: 10s
  redis:
    ttl: 30s
objectStorage:
  accessKey: ""
  enable: false
  endpoint: ""
  name: s3
  region: ""
  secretKey: ""
security:
  autoIssueCert: false
  caCert: ""
  caKey: ""
  certSpec:
    dnsNames:
    - dragonfly-manager
    - dragonfly-manager.dragonfly-system.svc
    - dragonfly-manager.dragonfly-system.svc.cluster.local
    ipAddresses: null
    validityPeriod: 87600h
  tlsPolicy: prefer
network:
  enableIPv6: false
metrics:
  enable: false
  addr: ":8000"
console: true
verbose: false
jaeger: http://dragonfly-jaeger-collector.dragonfly-system.svc.cluster.local:14268/api/traces