dragonflyoss / Dragonfly2

Dragonfly is an open source P2P-based file distribution and image acceleration system. It is hosted by the Cloud Native Computing Foundation (CNCF) as an Incubating Level Project.
https://d7y.io
Apache License 2.0
2.1k stars 263 forks source link

Piece digest mismatch during docker pull #3352

Open wbpascal opened 1 week ago

wbpascal commented 1 week ago

Bug report:

We have recently started using Dragonfly for various caching scenarios, but currently have trouble making it work with docker pull when using the newest Helm chart version (1.1.67 at the time of writing) with docker only printing the error filesystem layer verification failed....

All our Kubernetes nodes are without public internet access and therefore access all their resources through a shared Artifactory instance. Furthermore, due to our current use-case, we want to cache docker image pulls that are running in a docker-in-docker container. We have followed the Docker Integration Guide from the official documentation and set the proxy for the docker daemon to the IP and port of the local Dragonfly daemon. Additionally, we set a CA cert and key for the daemon and seed clients. However, the newest Helm chart version uses the new Rust client, so I don't know how applicable this guide still is. I would also like to add that the caching and pulling already works for the containerD instance of the Kubernetes nodes instances themselves, although we use the registryMirror instead of the proxy function for this.

We can see that the Dragonfly cache is successfully being used as we trace the requests through a Jaeger instance, but the download of a cached blob seems to fail in multiple places due to a digest mismatch. The first failure can be seen when the local Dragonfly client tries to get the already cached piece from another seed client, where it successfully receives the piece from the seed client but fails to verify the piece digest. The following log can be seen in Jaeger in the appropriate trace:

key value
event download piece finished: piece digest mismatch
code.filepath dragonfly-client/src/task/piece.rs
code.lineno 393
code.namespace dragonfly_client::task::piece
level ERROR
target dragonfly_client::task::piece

After a few failures, the fallback "back-to-source" download of the local Dragonfly client is then activated, which successfully downloads the data from the Docker registry in our Artifactory instance to the local storage. However, when the data from the Dragonfly client is then downloaded by the docker daemon that initially started the pull, the pull errors out with the message filesystem layer verification failed for digest sha256:....

There was apparently a similar issue a few years ago (#784), where a docker pull also failed due to a sha256 mismatch, but this was fixed in a Pull Request soon after for the Go implementation. Could this maybe be an issue with the new Rust implementation?

Expected behavior:

Docker pull succeeds without an error.

How to reproduce it:

  1. Create an Artifactory Docker registry, either as a mirror for another Docker registry or as a "local" one where you push some container image to.
    • I am unsure if this is really required, but I do not have any other Docker registry currently, which our Kubernetes nodes can pull from.
  2. Create a TLS secret called dragonfly-proxy-ca containing the CA key and cert in the dragonfly namespace
    • Make sure this CA cert is also trusted by the relevant docker-in-docker containers (from where the pull should occur)
  3. Create a values.yaml with the following content
manager:
  enable: true
  config:
    verbose: true
    console: true
    pprofPort: 18066
    jaeger: "http://jaeger-collector.jaeger.svc:14268/api/traces"

scheduler:
  enable: true
  config:
    verbose: true
    console: true
    pprofPort: 18066
    jaeger: "http://jaeger-collector.jaeger.svc:14268/api/traces"

seedClient:
  enable: true
  persistence:
    enable: false
  metrics:
    enable: true
  extraVolumes:
  - name: proxy-ca-certs
    secret:
      secretName: dragonfly-proxy-ca
  extraVolumeMounts:
  - name: proxy-ca-certs
    mountPath: /usr/share/dragonfly-proxy-ca/
  config:
    verbose: true
    console: true
    pprofPort: 18066
    tracing:
      addr: "jaeger-agent.jaeger.svc:6831"
    download:
      # -- Total download limit per second[B].
      totalRateLimit: 10000000000 # 10Gi
      # -- Per peer task limit per second[B].
      perPeerRateLimit: 5000000000 # 5Gi
    upload:
      # -- Upload limit per second[B].
      rateLimit: 10000000000 # 10Gi
    storage:
      taskExpireTime: 168h # 1 week
    proxy:
      server:
        caCert: /usr/share/dragonfly-proxy-ca/tls.crt
        caKey: /usr/share/dragonfly-proxy-ca/tls.key
      rules:
      # -- Proxy all http image layer download requests with dfget.
      - regex: blobs/sha256.*

client:
  enable: true
  persistence:
    enable: false
  metrics:
    enable: true
  extraVolumes:
  - name: proxy-ca-certs
    secret:
      secretName: dragonfly-proxy-ca
  extraVolumeMounts:
  - name: proxy-ca-certs
    mountPath: /usr/share/dragonfly-proxy-ca/
  config:
    verbose: true
    console: true
    pprofPort: 18066
    tracing:
      addr: "jaeger-agent.jaeger.svc:6831"
    download:
      # -- Total download limit per second[B].
      totalRateLimit: 10000000000 # 10Gi
      # -- Per peer task limit per second[B].
      perPeerRateLimit: 5000000000 # 5Gi
    upload:
      # -- Upload limit per second[B].
      rateLimit: 10000000000 # 10Gi
    storage:
      taskExpireTime: 168h # 1 week
    proxy:
      server:
        caCert: /usr/share/dragonfly-proxy-ca/tls.crt
        caKey: /usr/share/dragonfly-proxy-ca/tls.key
      rules:
      # -- Proxy all http image layer download requests with dfget.
      - regex: blobs/sha256.*
  dfinit:
    enable: true
    config:
      containerRuntime:
        containerd:
          configPath: /var/lib/rancher/rke2/agent/etc/containerd/config.toml
          registries:
          - hostNamespace: artifactory-registry
            serverAddr: https://artifactory-registry
            capabilities: ["pull", "resolve"]
  1. Install the dragonfly helm chart version 1.1.67 with the above values.yaml file into the dragonfly namespace
    • Replace artifactory-registry from the values.yaml with the Artifactory Docker registry address
  2. Start a docker-in-docker container where the dragonfly daemon is used as the proxy in the docker daemon. For example, use the following container snippet:
- name: docker-sidecar
  image: docker:24.0.9-dind
  env:
  - name: NODE_IP
    valueFrom:
      fieldRef:
        fieldPath: status.hostIP
  command: ["/bin/sh", "-c"]
  args:
    - |
      dockerd --host="unix:///usr/share/pod/docker.sock" \
              --host="unix:///var/run/docker.sock" \
              --cri-containerd \
              --http-proxy "http://${NODE_IP}:4001" \
              --https-proxy "http://${NODE_IP}:4001" \
              --mtu 1450 \
              --network-control-plane-mtu 1450 \
              --default-network-opt bridge=com.docker.network.driver.mtu=1450 &
  securityContext:
    privileged: true
  1. Try to pull an image from the Artifactory registry

Environment: