cloudfoundry / cf-for-k8s

The open source deployment manifest for Cloud Foundry on Kubernetes
Apache License 2.0
300 stars 115 forks source link

istio-proxy CrashLoopBackOff cf-blobstore-minio, cf-db-postgresql, log-cache #570

Open macevil opened 3 years ago

macevil commented 3 years ago

Describe the bug

The 3 containers (cf-blobstore-minio, cf-db-postgresql, log-cache) remain in a loop during the installation, they hang due to an istio error. The error is the same between the main branch and the istio 1.7.4 branch. Exemplary shown at cf-db-postgresql-0:

NAMESPACE            NAME                                                                     READY   STATUS             RESTARTS   AGE
cf-blobstore         cf-blobstore-minio-7467c77cbf-jqk66                                      0/2     CrashLoopBackOff   8          9m28s
cf-db                cf-db-postgresql-0                                                       1/2     CrashLoopBackOff   4          9m28s
cf-system            log-cache-7789cbc896-gmwzf                                               3/5     CrashLoopBackOff   4          9m28s

To Reproduce*

Steps to reproduce the behavior:

  1. kapp Installation of cf-4-k8s like described here: https://github.com/cloudfoundry/cf-for-k8s/blob/develop/docs/getting-started-tutorial.md
  2. See pod error

Expected behavior

A running cf-for-k8s installation.

Additional context

kubectl pod cf-db-postgresql-0 -n cf-db

  Type     Reason               Age        From                                                     Message
  ----     ------               ----       ----                                                     -------
  Normal   Scheduled            <unknown>  default-scheduler                                        Successfully assigned cf-db/cf-db-postgresql-0 to kw-bb79ae50-0000
  Normal   Pulling              4m17s      kubelet, kw-bb79ae50-0000  Pulling image "gcr.io/cf-routing/proxyv2:1.7.4"
  Normal   Created              4m16s      kubelet, kw-bb79ae50-0000  Created container istio-init
  Normal   Started              4m16s      kubelet, kw-bb79ae50-0000  Started container istio-init
  Normal   Pulled               4m16s      kubelet, kw-bb79ae50-0000  Successfully pulled image "gcr.io/cf-routing/proxyv2:1.7.4"
  Warning  FailedPostStartHook  3m15s      kubelet, kw-bb79ae50-0000  Exec lifecycle hook ([pilot-agent wait]) for Container "istio-proxy" in Pod "cf-db-postgresql-0_cf-db(cbe24c2a-4f41-4944-9523-2ed483cd7cc8)" failed - error: command 'pilot-agent wait' exited with 255: Error: timeout waiting for Envoy proxy to become ready. Last error: Get "http://localhost:15021/healthz/ready": dial tcp 127.0.0.1:15021: connect: connection refused
, message: "2020-11-12T18:48:38.803425Z\tinfo\tWaiting for Envoy proxy to be ready (timeout: 60 seconds)...\n2020-11-12T18:49:38.874103Z\terror\ttimeout waiting for Envoy proxy to become ready. Last error: Get \"http://localhost:15021/healthz/ready\": dial tcp 127.0.0.1:15021: connect: connection refused\nError: timeout waiting for Envoy proxy to become ready. Last error: Get \"http://localhost:15021/healthz/ready\": dial tcp 127.0.0.1:15021: connect: connection refused\n"
  Normal   Pulled               3m9s  kubelet, kw-bb79ae50-0000  Container image "index.docker.io/bitnami/postgresql@sha256:0f76a419cfd9996036e3a53672f50cf69ed7699f1241cbf8e20af17bbbdf0683" already present on machine
  Normal   Started              3m9s  kubelet, kw-bb79ae50-0000  Started container cf-db-postgresql
  Normal   Created              3m9s  kubelet, kw-bb79ae50-0000  Created container cf-db-postgresql
  Warning  FailedPostStartHook  2m8s  kubelet, kw-bb79ae50-0000  Exec lifecycle hook ([pilot-agent wait]) for Container "istio-proxy" in Pod "cf-db-postgresql-0_cf-db(cbe24c2a-4f41-4944-9523-2ed483cd7cc8)" failed - error: command 'pilot-agent wait' exited with 255: Error: timeout waiting for Envoy proxy to become ready. Last error: Get "http://localhost:15021/healthz/ready": dial tcp 127.0.0.1:15021: connect: connection refused
, message: "2020-11-12T18:49:45.307327Z\tinfo\tWaiting for Envoy proxy to be ready (timeout: 60 seconds)...\n2020-11-12T18:50:45.381055Z\terror\ttimeout waiting for Envoy proxy to become ready. Last error: Get \"http://localhost:15021/healthz/ready\": dial tcp 127.0.0.1:15021: connect: connection refused\nError: timeout waiting for Envoy proxy to become ready. Last error: Get \"http://localhost:15021/healthz/ready\": dial tcp 127.0.0.1:15021: connect: connection refused\n"
  Normal   Killing  2m8s (x2 over 3m15s)  kubelet, kw-bb79ae50-0000  FailedPostStartHook
  Warning  BackOff  2m1s (x2 over 2m2s)   kubelet, kw-bb79ae50-0000  Back-off restarting failed container
  Normal   Created  109s (x3 over 4m15s)  kubelet, kw-bb79ae50-0000  Created container istio-proxy
  Normal   Pulled   109s (x3 over 4m15s)  kubelet, kw-bb79ae50-0000  Successfully pulled image "gcr.io/cf-routing/proxyv2:1.7.4"
  Normal   Pulling  109s (x3 over 4m15s)  kubelet, kw-bb79ae50-0000  Pulling image "gcr.io/cf-routing/proxyv2:1.7.4"
  Normal   Started  109s (x3 over 4m15s)  kubelet, kw-bb79ae50-0000  Started container istio-proxy

kubectl logs cf-db-postgresql-0 istio-proxy -n cf-db

2020-11-12T19:01:13.718995Z     info    FLAG: --concurrency="2"
2020-11-12T19:01:13.719028Z     info    FLAG: --disableInternalTelemetry="false"
2020-11-12T19:01:13.719032Z     info    FLAG: --domain="cf-db.svc.cluster.local"
2020-11-12T19:01:13.719034Z     info    FLAG: --help="false"
2020-11-12T19:01:13.719035Z     info    FLAG: --id=""
2020-11-12T19:01:13.719037Z     info    FLAG: --ip=""
2020-11-12T19:01:13.719040Z     info    FLAG: --log_as_json="false"
2020-11-12T19:01:13.719042Z     info    FLAG: --log_caller=""
2020-11-12T19:01:13.719045Z     info    FLAG: --log_output_level="default:info"
2020-11-12T19:01:13.719047Z     info    FLAG: --log_rotate=""
2020-11-12T19:01:13.719050Z     info    FLAG: --log_rotate_max_age="30"
2020-11-12T19:01:13.719052Z     info    FLAG: --log_rotate_max_backups="1000"
2020-11-12T19:01:13.719055Z     info    FLAG: --log_rotate_max_size="104857600"
2020-11-12T19:01:13.719058Z     info    FLAG: --log_stacktrace_level="default:none"
2020-11-12T19:01:13.719067Z     info    FLAG: --log_target="[stdout]"
2020-11-12T19:01:13.719070Z     info    FLAG: --meshConfig="./etc/istio/config/mesh"
2020-11-12T19:01:13.719072Z     info    FLAG: --mixerIdentity=""
2020-11-12T19:01:13.719074Z     info    FLAG: --outlierLogPath=""
2020-11-12T19:01:13.719075Z     info    FLAG: --proxyComponentLogLevel="misc:error"
2020-11-12T19:01:13.719077Z     info    FLAG: --proxyLogLevel="warning"
2020-11-12T19:01:13.719079Z     info    FLAG: --serviceCluster="postgresql.cf-db"
2020-11-12T19:01:13.719080Z     info    FLAG: --serviceregistry="Kubernetes"
2020-11-12T19:01:13.719082Z     info    FLAG: --stsPort="0"
2020-11-12T19:01:13.719084Z     info    FLAG: --templateFile=""
2020-11-12T19:01:13.719085Z     info    FLAG: --tokenManagerPlugin="GoogleTokenExchange"
2020-11-12T19:01:13.719087Z     info    FLAG: --trust-domain="cluster.local"
2020-11-12T19:01:13.719115Z     info    Version 1.7.4-4ce531ff1823a3abb9f42fa9d35523b0436e2d04-Clean
2020-11-12T19:01:13.719257Z     info    Obtained private IP [10.233.85.101]
2020-11-12T19:01:13.719327Z     info    Apply proxy config from env {"proxyMetadata":{"DNS_AGENT":""}}

2020-11-12T19:01:13.720167Z     info    Effective config: binaryPath: /usr/local/bin/envoy
concurrency: 2
configPath: ./etc/istio/proxy
controlPlaneAuthPolicy: MUTUAL_TLS
discoveryAddress: istiod.istio-system.svc:15012
drainDuration: 45s
envoyAccessLogService: {}
envoyMetricsService: {}
parentShutdownDuration: 60s
proxyAdminPort: 15000
proxyMetadata:
  DNS_AGENT: ""
serviceCluster: postgresql.cf-db
statNameLength: 189
statusPort: 15020
terminationDrainDuration: 5s
tracing:
  zipkin:
    address: zipkin.istio-system:9411

2020-11-12T19:01:13.720198Z     info    Proxy role: &model.Proxy{Type:"sidecar", IPAddresses:[]string{"10.233.85.101"}, ID:"cf-db-postgresql-0.cf-db", Locality:(*envoy_config_core_v3.Locality)(nil), DNSDomain:"cf-db.svc.cluster.local", ConfigNamespace:"", Metadata:(*model.NodeMetadata)(nil), SidecarScope:(*model.SidecarScope)(nil), PrevSidecarScope:(*model.SidecarScope)(nil), MergedGateway:(*model.MergedGateway)(nil), ServiceInstances:[]*model.ServiceInstance(nil), IstioVersion:(*model.IstioVersion)(nil), ipv6Support:false, ipv4Support:false, GlobalUnicastIP:"", XdsResourceGenerator:model.XdsResourceGenerator(nil), Active:map[string]*model.WatchedResource(nil), ActiveExperimental:map[string]*model.WatchedResource(nil), RequestedTypes:struct { CDS string; EDS string; RDS string; LDS string }{CDS:"", EDS:"", RDS:"", LDS:""}}
2020-11-12T19:01:13.720201Z     info    JWT policy is first-party-jwt
2020-11-12T19:01:13.720234Z     info    PilotSAN []string{"istiod.istio-system.svc"}
2020-11-12T19:01:13.720237Z     info    MixerSAN []string{"spiffe://cluster.local/ns/istio-system/sa/istio-mixer-service-account"}
2020-11-12T19:01:13.720269Z     info    sa.serverOptions.CAEndpoint == istiod.istio-system.svc:15012
2020-11-12T19:01:13.720273Z     info    Using user-configured CA istiod.istio-system.svc:15012
2020-11-12T19:01:13.720275Z     info    istiod uses self-issued certificate
2020-11-12T19:01:13.720312Z     info    the CA cert of istiod is: -----BEGIN CERTIFICATE-----
MIIC3TCCAcWgAwIBAgIQapJx+9diCFILU4GdGIwEOTANBgkqhkiG9w0BAQsFADAY
MRYwFAYDVQQKEw1jbHVzdGVyLmxvY2FsMB4XDTIwMTExMjE4NDgyNFoXDTMwMTEx
MDE4NDgyNFowGDEWMBQGA1UEChMNY2x1c3Rlci5sb2NhbDCCASIwDQYJKoZIhvcN
AQEBBQADggEPADCCAQoCggEBAN/DXVHDimEqNIJrIyw8WPB87RRFEbvtpWxGxtK+
nTQsPTXosCZqelJUdE6sCUD/fCa1AGFa2hbTvSdEGQwi2LeW7U3V7h4RrZkFE/JF
o8uhNeh//AFx3iGlzHrQnu7n5S/DAKrVL/BxOZZIC33Q8gorNeuQoHcb0HkdegTO
vThDNo0iJlbfhUjiGavRuOfC1pSJkXUZWfz++HtpnWb5XX+KI2kf6amd1JB0VROk
AJagGlgXnCyBU7ZwZGVq/uWuOBR1jRIVYx2iTH2d7k2Qm9oC3r/lO4KaxUEStzIF
tL2IzDZrWlTWPfrJAAc/BnEqNw2xdr+T7fw6D5Zf+Ko/DO8CAwEAAaMjMCEwDgYD
VR0PAQH/BAQDAgIEMA8GA1UdEwEB/wQFMAMBAf8wDQYJKoZIhvcNAQELBQADggEB
AGfEzOhlfwZcVTjcVDBnNbj2VGLYQX5ca/VTwWxN4paFHdyKdht+ksiKWl3sYfU9
jypwoKn1E3MYPGpP5KQ9/h/7xm2ugZCUcH+2F9KtfPdpOwinUgZIUR2R2a2M/hAe
YpAzyPLgv+H2AlLhKohCztMhqTjKjPwWzuIyPtnCrHZrFlkVBRpSZWTrkWS8fO21
e44zvxrS3JBWBvsICvE6+BNBhDoa/j3KhsO2eCL70pWhb6OwpwnFyGcD+e9udX4q
ifwbe/RJU0fCKJVvrbI6aF3uov+/3KSB2rPTLllGz38zIHXgne/rltshbGM0sBMZ
dCjVQJyHmGbxrK9x4RQX9sQ=
-----END CERTIFICATE-----

2020-11-12T19:01:13.757302Z     info    sds     SDS gRPC server for workload UDS starts, listening on "./etc/istio/proxy/SDS"

2020-11-12T19:01:13.757417Z     info    Starting proxy agent
2020-11-12T19:01:13.757477Z     info    Opening status port 15020

2020-11-12T19:01:13.757543Z     info    sds     Start SDS grpc server
2020-11-12T19:01:13.757569Z     info    Received new config, creating new Envoy epoch 0
2020-11-12T19:01:13.757631Z     info    Epoch 0 starting
2020-11-12T19:01:15.760254Z     info    Envoy command: [-c etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --parent-shutdown-time-s 60 --service-cluster postgresql.cf-db --service-node sidecar~10.233.85.101~cf-db-postgresql-0.cf-db~cf-db.svc.cluster.local --local-address-ip-version v4 --log-format-prefix-with-location 0 --log-format %Y-%m-%dT%T.%fZ   %l      envoy %n        %v -l warning --component-log-level misc:error --concurrency 2]
2020-11-12T19:01:15.797727Z     warning envoy runtime   Unable to use runtime singleton for feature envoy.reloadable_features.activate_fds_next_event_loop
2020-11-12T19:01:15.828047Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 14, no healthy upstream
2020-11-12T19:01:15.828075Z     warning envoy config    Unable to establish new stream
2020-11-12T19:01:16.156522Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 14, no healthy upstream
2020-11-12T19:01:16.156556Z     warning envoy config    Unable to establish new stream
2020-11-12T19:01:17.989351Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 14, no healthy upstream
2020-11-12T19:01:17.989387Z     warning envoy config    Unable to establish new stream
2020-11-12T19:01:27.533580Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 14, no healthy upstream
2020-11-12T19:01:27.533742Z     warning envoy config    Unable to establish new stream
2020-11-12T19:01:56.955688Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 14, no healthy upstream
2020-11-12T19:01:56.955727Z     warning envoy config    Unable to establish new stream
2020-11-12T19:02:13.468761Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 14, no healthy upstream
2020-11-12T19:02:13.468793Z     warning envoy config    Unable to establish new stream
2020-11-12T19:02:13.928066Z     info    Watcher has successfully terminated
2020-11-12T19:02:13.928062Z     info    Agent draining Proxy
2020-11-12T19:02:13.928053Z     info    Status server has successfully terminated
2020-11-12T19:02:13.928258Z     error   accept tcp [::]:15020: use of closed network connection
2020-11-12T19:02:13.928906Z     info    Graceful termination period is 5s, starting...
2020-11-12T19:02:18.929006Z     info    Graceful termination period complete, terminating remaining proxies.
2020-11-12T19:02:18.929046Z     warn    Aborting epoch 0...
2020-11-12T19:02:18.929050Z     warn    Aborted all epochs
2020-11-12T19:02:18.929052Z     info    Agent has successfully terminated
2020-11-12T19:02:18.929339Z     warning envoy main      caught SIGTERM

Deploy instructions

$ ./hack/generate-values.sh -d vcap.me > ${TMP_DIR}/cf-values.yml
$ cat << EOF >> ${TMP_DIR}/cf-values.yml
app_registry:
  hostname: https://index.docker.io/v1/
  repository_prefix: "<my_username>"
  username: "<my_username>"
  password: "<my_password>"

use_first_party_jwt_tokens: true

$ kapp deploy -a cf -f <(ytt -f config -f ${TMP_DIR}/cf-values.yml)

Cluster information

kubespray 2.14

CLI versions

  1. ytt --version: 0.30.0
  2. kapp --version: 0.34.0
  3. kubectl version: Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:40:16Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-13T16:04:18Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
  4. cf version: 7.1.0+4c3168f9a.2020-09-09
cf-gitbot commented 3 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/175707579

The labels on this github issue will be updated when the story is started.

macevil commented 3 years ago

okay, as far as i can see it's a problem with the network policies and, after we (@nrekretep) changed them a bit, all pods start. we investigate further.

kauana commented 3 years ago

Hello @macevil, your issue seems suspiciously similar to this istio issue. I am curious to know what changes you've made to the network policies that made the pods start? Thanks

cc: @astrieanna

macevil commented 3 years ago

Hello @kauana and @astrieanna , so we believe that the standard network policy set prevents necessary communications. we have deleted network policies until all pods are started:

Bildschirmfoto 2020-11-14 um 23 47 47

now we have reinstalled cf and investigate the iptables for violations.

jamespollard8 commented 3 years ago

Hey @macevil , thanks for submitting this. We're wondering what's different about your environment that's causing this problem. We're currently testing across GKE, AKS, EKS, minikube, and kind and not seeing this issue in any of those environments.

We suspect that in addition to use_first_party_jwt_tokens: true, you may also want to try enable_automount_service_account_token: true

Could you please try that out and let us know if you could use additional help from us?

macevil commented 3 years ago

good morning @jamespollard8, i have set the flag enable_automount_service_account_token: true additionally, but no, the problem remains the same. The application only starts when I delete the network policies.

acosta11 commented 3 years ago

Hi @macevil,

Thanks for the giving that a try! With that, we're out of ideas for next troubleshooting steps short of actually digging into kubespray ourselves. Given the prioritization of IaaS managed clusters, we don't expect to be able to prioritize investigating kubespray further. In the meantime we're leaving this issue open.

We do want to state that while deleting the network policies may get the installation to complete now, we recommend keeping the network policies enabled for any production environments.

Thanks, Andrew and @jamespollard8

devops-school commented 3 years ago

Do you have any updates on this issues? I have tried in Azure VM and experience same issues

`6:56:42AM: L ok: waiting on replicaset/cf-blobstore-minio-65cc549448 (apps/v1) namespace: cf-blobstore 6:56:42AM: L ongoing: waiting on pod/cf-blobstore-minio-65cc549448-shqn9 (v1) namespace: cf-blobstore 6:56:42AM: ^ Condition Ready is not True (False) 6:56:49AM: fail: reconcile deployment/kpack-controller (apps/v1) namespace: kpack 6:56:49AM: ^ Deployment is not progressing: ProgressDeadlineExceeded (message: ReplicaSet "kpack-controller-6c555dc47f" has timed out progressing.)

kapp: Error: waiting on reconcile deployment/kpack-controller (apps/v1) namespace: kpack: Finished unsuccessfully (Deployment is not progressing: ProgressDeadlineExceeded (message: ReplicaSet "kpack-controller-6c555dc47f" has timed out progressing.))`

devops-school commented 3 years ago

For me solution were kind create cluster --config=./deploy/kind/cluster.yml --image kindest/node:v1.20.2 instead of kind create cluster --config=./deploy/kind/cluster.yml --image kindest/node:v1.21.1

Means - v1.20.2 version is wroking