cloudfoundry / cf-for-k8s

The open source deployment manifest for Cloud Foundry on Kubernetes
Apache License 2.0
299 stars 115 forks source link

`cf push` fails on KinD: cf-default-builder image "labels" size greater than max allowed size #444

Closed jamespollard8 closed 4 years ago

jamespollard8 commented 4 years ago

duplicate of https://github.com/pivotal/kpack/issues/473

Describe the bug

We've been seeing consistent failures in our CI when running smoke tests on KinD. After debugging, we found this image pull failure: Failed to pull image "gcr.io/.../cf-default-builder@sha256:977feb...": rpc error: code = InvalidArgument desc = failed to pull and unpack image "gcr.io/cf-relint-greengrass/cf-workloads/cf-default-builder@sha256:977febd...": failed to prepare extraction snapshot "extract-712270633-CRQK sha256:f0dcba2...": info.Labels: label key and value greater than maximum size (4096 bytes), key: containerd: invalid argument

From the smoke test output, this manifests as hanging after these lines:

Staging app and tracing logs...
   Loading secret for "gcr.io" from secret "cc-kpack-registry-auth-secret-ver-1" at location "/var/build-secrets/cc-kpack-registry-auth-secret-ver-1"
   Successfully downloaded cf-blobstore-minio.cf-blobstore.svc.cluster.local:9000/cc-packages/d7/d0/d7d01fef-dfdd-44df-8a3b-a03e40379c29 in path "/workspace"
... <- hangs here

To Reproduce*

Steps to reproduce the behavior:

  1. Setup a KinD cluster with Kubernetes version 1.18.4 or later and connect to it
  2. Follow the docs to install cf-for-k8s on kind https://github.com/cloudfoundry/cf-for-k8s/blob/master/docs/deploy-local.md#steps-to-deploy-on-kind
  3. Run smoke tests or cf push

Expected behavior

Smoke tests succeed on KinD.

Additional context

As a work-around, we've added an overlay to tear out all non-node buildpacks from the cf-default-builder. 59fa844

cf-gitbot commented 4 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/174833518

The labels on this github issue will be updated when the story is started.

jamespollard8 commented 4 years ago

Copying over some comments we'd written up on the kpack issue:

@davewalter said: We tried deploying on Kind with K8s v1.16.9 on GCP and the same version of cf-for-k8s. When we pushed our test node app to the platform, we were initially surprised to see that the digest for the cf-default-builder image was identical to the one created with K8s v1.18.6, but on reflection, this seems to make sense, given that the builder/stack and store definitions are all identical.

We next tried removing all of the buildpacks, with the exception of the Node JS buildpack that is required for our sample app. For reference, here is the store/stack/builder definitions we deployed:

#@ load("@ytt:data", "data")

---
apiVersion: experimental.kpack.pivotal.io/v1alpha1
kind: Store
metadata:
  name: cf-buildpack-store
spec:
  sources:
  - image: gcr.io/paketo-buildpacks/nodejs@sha256:7110ff41a35ec4d8a0fbb63e7b292c2edc7ef0e072e542cd0a58e5d179ce2605

---
apiVersion: experimental.kpack.pivotal.io/v1alpha1
kind: Stack
metadata:
  name: cflinuxfs3-stack
spec:
  id: "org.cloudfoundry.stacks.cflinuxfs3"
  buildImage:
    image: "gcr.io/paketo-buildpacks/build@sha256:84f7b60192e69036cb363b2fc7d9834cff69dcbcf7aaf8c058d986fdee6941c3"
  runImage:
    image: "gcr.io/paketo-buildpacks/run@sha256:84f7b60192e69036cb363b2fc7d9834cff69dcbcf7aaf8c058d986fdee6941c3"

---
apiVersion: experimental.kpack.pivotal.io/v1alpha1
kind: CustomBuilder
metadata:
  name: cf-default-builder
  namespace: #@ data.values.staging_namespace
spec:
  tag: #@ "{}/cf-default-builder".format(data.values.app_registry.repository_prefix)
  serviceAccount: cc-kpack-registry-service-account
  stack: cflinuxfs3-stack
  store: cf-buildpack-store
  order:
  - group:
    - id: paketo-buildpacks/nodejs

This did result in a new builder image, which had much smaller annotations:

[
    {
        ...
        "Config": {
            ...
            "Labels": {
                "io.buildpacks.builder.metadata": "{\"description\":\"Custom Builder built with kpack\",\"stack\":{\"runImage\":{\"image\":\"gcr.io/paketo-buildpacks/run@sha256:84f7b60192e69036cb363b2fc7d9834cff69dcbcf7aaf8c058d986fdee6941c3\",\"mirrors\":null}},\"lifecycle\":{\"version\":\"0.8.1\",\"api\":{\"buildpack\":\"0.2\",\"platform\":\"0.3\"}},\"createdBy\":{\"name\":\"kpack CustomBuilder\",\"version\":\"v0.0.10 (git sha: 68925eaca94becfeef006c413ebac4fde559e66c)\"},\"buildpacks\":[{\"id\":\"paketo-buildpacks/node-engine\",\"version\":\"0.0.260\",\"homepage\":\"https://github.com/paketo-buildpacks/node-engine\"},{\"id\":\"paketo-buildpacks/yarn-install\",\"version\":\"0.1.86\",\"homepage\":\"https://github.com/paketo-buildpacks/yarn-install\"},{\"id\":\"paketo-buildpacks/npm\",\"version\":\"0.1.79\",\"homepage\":\"https://github.com/paketo-buildpacks/npm\"},{\"id\":\"paketo-buildpacks/nodejs\",\"version\":\"0.0.5\",\"homepage\":\"https://github.com/paketo-buildpacks/nodejs\"}]}",
                "io.buildpacks.buildpack.layers": "{\"paketo-buildpacks/node-engine\":{\"0.0.260\":{\"api\":\"0.2\",\"layerDiffID\":\"sha256:7d57d604f4efbf533639810fc3d590b2c334382097f4cdd4c3d5bcfaf8b1bd15\",\"stacks\":[{\"id\":\"io.buildpacks.stacks.bionic\"},{\"id\":\"org.cloudfoundry.stacks.cflinuxfs3\"}],\"homepage\":\"https://github.com/paketo-buildpacks/node-engine\"}},\"paketo-buildpacks/nodejs\":{\"0.0.5\":{\"api\":\"0.2\",\"layerDiffID\":\"sha256:71dd90c6ed436af5fd5027789c76ef7bc19d71ea2b46ec602dbdbe0c6e7ee9af\",\"order\":[{\"group\":[{\"id\":\"paketo-buildpacks/node-engine\",\"version\":\"0.0.260\"},{\"id\":\"paketo-buildpacks/yarn-install\",\"version\":\"0.1.86\"}]},{\"group\":[{\"id\":\"paketo-buildpacks/node-engine\",\"version\":\"0.0.260\"},{\"id\":\"paketo-buildpacks/npm\",\"version\":\"0.1.79\"}]}],\"homepage\":\"https://github.com/paketo-buildpacks/nodejs\"}},\"paketo-buildpacks/npm\":{\"0.1.79\":{\"api\":\"0.2\",\"layerDiffID\":\"sha256:687909d9abefc647892cbe213a425cf37097f5cd96eb79987f5fab4eb8b6af18\",\"stacks\":[{\"id\":\"org.cloudfoundry.stacks.cflinuxfs3\"},{\"id\":\"io.buildpacks.stacks.bionic\"}],\"homepage\":\"https://github.com/paketo-buildpacks/npm\"}},\"paketo-buildpacks/yarn-install\":{\"0.1.86\":{\"api\":\"0.2\",\"layerDiffID\":\"sha256:25eefd03f95bf374fa44cadd0fc7a9c7ef943ba1a7faa0fc90379786e0c07b21\",\"stacks\":[{\"id\":\"org.cloudfoundry.stacks.cflinuxfs3\"},{\"id\":\"io.buildpacks.stacks.bionic\"}],\"homepage\":\"https://github.com/paketo-buildpacks/yarn-install\"}}}",
                "io.buildpacks.buildpack.order": "[{\"group\":[{\"id\":\"paketo-buildpacks/nodejs\",\"version\":\"0.0.5\"}]}]",
                "io.buildpacks.stack.id": "org.cloudfoundry.stacks.cflinuxfs3"
            }
        },
        "Architecture": "amd64",
        "Os": "linux",
        "Size": 1085920101,
        "VirtualSize": 1085920101,
        "GraphDriver": {
            "Data": {
                "LowerDir": "/var/lib/docker/overlay2/ed68aeaecb6b741619bb19701aaa44de0ffec69b0c9ae506925df5885fcf89db/diff:/var/lib/docker/overlay2/c0d8b46fab8cc1bd50fafb8c488bff78dfded88e7fdac52fbf3d03a6f61dc5bd/diff:/var/lib/docker/overlay2/c055453ec8328a06aea7e48fa15532749c111b36b9de976263bf51ca351451d2/diff:/var/lib/docker/overlay2/edd6efe0267833bbce1f84157c7ec0cac1cdac63e51a6d825793c84d6bd3d328/diff:/var/lib/docker/overlay2/14b725c618f9984cdbd685ca3b475ee39cad8598a5dd246411bb1db5d97ddad3/diff:/var/lib/docker/overlay2/d1999b90d1a3b57d9c63e0167e1602ed090ec4e05d9e3642fbc8961a67be3a35/diff:/var/lib/docker/overlay2/4616983ec5476539812eab442d439f404364f9c8e70c7278f9958542159ea939/diff:/var/lib/docker/overlay2/1cc6f315cab22e27897441b185a7d637764766565c7f719c702a9c780b3cc18d/diff",
                "MergedDir": "/var/lib/docker/overlay2/36d812a5c093893dfc5308adb9433af8d977568e342c5f4b509b7a6a77008f39/merged",
                "UpperDir": "/var/lib/docker/overlay2/36d812a5c093893dfc5308adb9433af8d977568e342c5f4b509b7a6a77008f39/diff",
                "WorkDir": "/var/lib/docker/overlay2/36d812a5c093893dfc5308adb9433af8d977568e342c5f4b509b7a6a77008f39/work"
            },
            "Name": "overlay2"
        },
        "RootFS": {
            "Type": "layers",
            "Layers": [
                "sha256:f0dcba2cbeedc3702b4f390963e697b06360502302ec52e0d63bcfa7235220d1",
                "sha256:7755b972f0b4f49de73ef5114fb3ba9c69d80f217e80da99f56f0d0a5dcb3d70",
                "sha256:c33cb7212a62bb159674eae50b7ace60bb7c73c70ea6f8597c72fff10189d78f",
                "sha256:7d57d604f4efbf533639810fc3d590b2c334382097f4cdd4c3d5bcfaf8b1bd15",
                "sha256:25eefd03f95bf374fa44cadd0fc7a9c7ef943ba1a7faa0fc90379786e0c07b21",
                "sha256:687909d9abefc647892cbe213a425cf37097f5cd96eb79987f5fab4eb8b6af18",
                "sha256:71dd90c6ed436af5fd5027789c76ef7bc19d71ea2b46ec602dbdbe0c6e7ee9af",
                "sha256:3ae93fd59e3e53f9d0cbea42625e12139a7fb8a59915dc23b0219a01a44e018f",
                "sha256:a7ce12a420636d1b1726e67e67055a3ead31453018be5c32019420cb9e3786ba"
            ]
        },
        "Metadata": {
            "LastTagTime": "0001-01-01T00:00:00Z"
        }
    }
]

We're not sure how this information helps, given that we are still unsure as to exactly which part of the image containerd is complaining about.

jamespollard8 commented 4 years ago

and We were able to isolate the problem to Kubernetes v1.18.x by pushing the cf-default-builder image to our public dockerhub repository and creating a simple deployment:

kubectl create deployment test --image relintdockerhubpushbot/cf-default-builder-test

When we inspect the resulting pod, we see the following events:

Events:
  Type     Reason     Age   From                         Message
  ----     ------     ----  ----                         -------
  Normal   Scheduled  13s   default-scheduler            Successfully assigned default/test-5dd6b896f8-6g459 to kind-control-plane
  Normal   Pulling    12s   kubelet, kind-control-plane  Pulling image "relintdockerhubpushbot/cf-default-builder-test"
  Warning  Failed     11s   kubelet, kind-control-plane  Failed to pull image "relintdockerhubpushbot/cf-default-builder-test": rpc error: code = InvalidArgument desc = failed to pull and unpack image "docker.io/relintdockerhubpushbot/cf-default-builder-test:latest": failed to prepare extraction snapshot "extract-345901367-cGhX sha256:ac0538e4e603b4d6027dabc72660256eafd203fac1be9e1ca15c7f3f2ce837d5": info.Labels: label key and value greater than maximum size (4096 bytes), key: containerd: invalid argument
  Warning  Failed     11s   kubelet, kind-control-plane  Error: ErrImagePull
  Normal   BackOff    11s   kubelet, kind-control-plane  Back-off pulling image "relintdockerhubpushbot/cf-default-builder-test"
  Warning  Failed     11s   kubelet, kind-control-plane  Error: ImagePullBackOff

We confirmed that downgrading our Kind cluster to Kubernetes version v1.17.5 allowed the test deployment pod to successfully pull the same image:

Events:
  Type     Reason            Age                  From                         Message
  ----     ------            ----                 ----                         -------
  Warning  FailedScheduling  2m42s                default-scheduler            0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
  Normal   Scheduled         2m37s                default-scheduler            Successfully assigned default/test-67f7dd9596-fxbhm to kind-control-plane
  Warning  Failed            2m9s                 kubelet, kind-control-plane  Error: failed to generate container "7d899e03e56ea344029e64e6f265ae48e0f880ed2f3a4d5dbfd59d7dd09cf9df" spec: no command specified
  Warning  Failed            2m8s                 kubelet, kind-control-plane  Error: failed to generate container "6260f8d08c1b56ba0987dbe812b877a1d3f864ee325de4a3a249bea159ce8ddb" spec: no command specified
  Warning  Failed            102s                 kubelet, kind-control-plane  Error: failed to generate container "18619d20a98400c35c8f44b6aab0c1da02b2eb9b9f8f114812867a019e771e0d" spec: no command specified
  Warning  Failed            87s                  kubelet, kind-control-plane  Error: failed to generate container "9704bd6d0e6afb70d6c8acfb0c9fe01a51df712d279ce7c081a50bb65ac0485a" spec: no command specified
  Warning  Failed            72s                  kubelet, kind-control-plane  Error: failed to generate container "72a321264c7fb3e5025bf20ed80a559c4a4c84dca4544f41ee37990c2b99e843" spec: no command specified
  Warning  Failed            56s                  kubelet, kind-control-plane  Error: failed to generate container "ca8c53eadd200ac9d31809ff0295608a550cbb1dc01b772ec112dc28f39c4185" spec: no command specified
  Warning  Failed            44s                  kubelet, kind-control-plane  Error: failed to generate container "2519e695108f2a8e8744083375c417c38952e8ff8346fb23af0c91be1564ebbf" spec: no command specified
  Normal   Pulled            32s (x8 over 2m9s)   kubelet, kind-control-plane  Successfully pulled image "relintdockerhubpushbot/cf-default-builder-test"
  Warning  Failed            32s                  kubelet, kind-control-plane  Error: failed to generate container "07ec32b8539949e2837f6f16dc614fba33491fed70e9c944922c3ab558836805" spec: no command specified
  Normal   Pulling           19s (x9 over 2m36s)  kubelet, kind-control-plane  Pulling image "relintdockerhubpushbot/cf-default-builder-test"

Given that this is with the same version of Kind (and therefore, we assume, containerd), our current working hypothesis is that a missing error check was added in Kubernetes v1.18, and that the error is always generated.

We will leave the public image available for y'all to test with. It was created with this set of store/stack/builder definitions.

jamespollard8 commented 4 years ago

We bisected the versions of K8s and confirmed that this error shows up for us if we are running KinD with a K8s version of v1.18.4 or above (specified using the --image flag). Running v1.18.2, or any earlier minor version of K8s does not exhibit this problem, even with the full store/stack/builder definition currently in cf-for-k8s.

jamespollard8 commented 4 years ago

One last test we ran before "end of day" was to reduce the size of the annotations on the cf-default-builder image by removing the ruby buildpack from the store/builder lists, leaving the other seven buildpacks, which allowed us to successfully push an app. When we pulled and inspected the builder image, we saw that the two keys mentioned in the original issue description were shorter, but were still much longer than the 4096-byte limit mentioned in the error message.

We're not sure what that means, but it leads me to speculate that it is not the annotations we can see that is causing the problem, and that K8s/Kind/Containerd is manipulating them in some way that is pushing us over the limit.

davewalter commented 4 years ago

We have since increased the size of the cf-default-builder image by adding the procfile buildpack to each language-specific group in the builder and are now seeing this issue on the latest patch releases of KinD v1.16 and v1.17, as well as v1.19, which was recently released.

ericpromislow commented 4 years ago

Source of the error message:

https://github.com/jessvalarezo/containerd/blob/18c4322bb3ddcb8f8b4eea2f3c027a06194041b4/labels/validate.go

Recent fix to containerd/cri: https://github.com/containerd/cri/pull/1572 (merged Sept 15, 2020)

From this PR, the math behind the breakage:

In containerd, there is a size limit for label size (4096 chars). If an image has many layers (> (4096-39)/72 > 56), containerd.io/snapshot/cri.image-layers will hit the limit of label > size and the unpack will fail because the annotation will be passed to the snapshotter as a label.

Related: https://github.com/containerd/stargz-snapshotter/pull/148/files

I've asked on the kube/kind slack channel what we need to do to get the fix under PR 1572:

https://kubernetes.slack.com/archives/CEKK1KTN2/p1600364196431100

  1. New build https://github.com/kind-ci/containerd-nightlies/releases/tag/containerd-1.4.0-85-gd6774b63 has this fix
ericpromislow commented 4 years ago

https://github.com/containerd/stargz-snapshotter/issues/144#issuecomment-694601447

Reports fixes in containerd/containerd and containerd/cri with some toml, but I assume we're dependent on a new version of kind pulling in these changes

BenTheElder commented 4 years ago

https://github.com/kubernetes-sigs/kind/releases/tag/v0.9.0#breaking-changes I think the workaround added to the end there should be sufficient for now, but if not we can expedite a fix release.

jamespollard8 commented 4 years ago

Great - thanks again @BenTheElder!