cloudfoundry / cf-for-k8s

The open source deployment manifest for Cloud Foundry on Kubernetes
Apache License 2.0
300 stars 115 forks source link

"info.Labels: label key and value greater than maximum size" on docker pull #614

Open joscha-alisch opened 3 years ago

joscha-alisch commented 3 years ago

Unfortunately we seem to be hitting a similar bug to this one: #444 but on GKE instead of KinD.

We run cf-for-k8s version 1.1.0 on GKE version 1.18.12-gke.1201 with nodes running on image type cos_containerd.

When we push an app via cf push, we see the following error:

Failed to pull image "gcr.io/PROJECT_ID/cf-workloads/cf-default-builder@sha256:b65a50427c08b4ade4e75582a4d6fa86c6da0eec7c1f1932c1881154c3527fb5": 
[rpc error: code = InvalidArgument desc = failed to pull and unpack image "gcr.io/PROJECT_ID/cf-workloads/cf-default-builder@sha256:b65a50427c08b4ade4e75582a4d6fa86c6da0eec7c1f1932c1881154c3527fb5": failed to prepare extraction snapshot "extract-929517039-zaeh sha256:e693876ebf739f21944936aabae94530f008be3cb7f14f66c4a2f4fd9b4bcf54": 
info.Labels: label key and value greater than maximum size (4096 bytes), key: containerd: invalid argument, rpc error: code = InvalidArgument desc = failed to pull and unpack image "gcr.io/PROJECT_ID/cf-workloads/cf-default-builder@sha256:b65a50427c08b4ade4e75582a4d6fa86c6da0eec7c1f1932c1881154c3527fb5": failed to prepare extraction snapshot "extract-809081699-XqHo sha256:e693876ebf739f21944936aabae94530f008be3cb7f14f66c4a2f4fd9b4bcf54": 
info.Labels: label key and value greater than maximum size (4096 bytes), key: containerd: invalid argument]

When we remove buildpack-groups from the cf-default-builder in cf-for-k8s/config/kpack/default-buildpacks.yml then it works fine (only keeping the go-buildpack for example). So the assumption is, that these cause too much metadata in the resulting image.

cf-gitbot commented 3 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/176592119

The labels on this github issue will be updated when the story is started.

jamespollard8 commented 3 years ago

Oh interesting - thanks @joscha-alisch for filing this issue!

From what I remember from #444, it looked like this issue was going to be fixed by an update in containerd. So hopefully that'll get into one of the next GKE releases / cos_containerd.

Until then, maybe you'll need to use the default cos (Docker) for your GKE cluster.

Does that sound reasonable?

jspawar commented 3 years ago

We have now run into this ourselves and as James said, this is an issue with containerd specifically.

The change has been made; however, it appears GKE hasn't pulled in the latest version of containerd. We know with reasonable confidence that containerd v1.4.2+ contain the fix [1], but it is unclear if GKE is using that version (probably not given that we're seeing the issue still)

Inspecting our GKE clusters' nodes (which are on v1.19.7-gke.1302) we see that containerd v1.4.1 is being used:

System Info:
  Kernel Version:             5.4.49+
  OS Image:                   Container-Optimized OS from Google
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.4.1
  Kubelet Version:            v1.19.7-gke.1302
  Kube-Proxy Version:         v1.19.7-gke.1302

[1] Version of CRI used in containerd v1.4.2: https://github.com/containerd/containerd/blob/b321d358e6eef9c82fa3f3bb8826dca3724c58c6/vendor.conf#L60 which indicates that the rel/1.4 branch of CRI is what is being used (that branch is the one which contains the changes at least for containerd pre-merging the CRI repo in)