cloudfoundry-incubator / kubecf

Cloud Foundry on Kubernetes
Apache License 2.0
115 stars 62 forks source link

Diego-cell pod failing with "CrashLoopBackOff" #1122

Open ghost opened 4 years ago

ghost commented 4 years ago

Describe the bug I am trying to deploy KubeCF (v2.2.2) and getting issues in diego-cell pod. The garden container is getting failed. Below are the logs for the garden container:

{"timestamp":"1594644900.966236353","source":"grootfs","message":"grootfs.init-store.store-manager-init-store.existing-backing-store-could-not-be-mounted: Mounting filesystem: exit status 32: mount: /var/vcap/data/grootfs/store/unprivileged: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error.\n","log_level":1,"data":{"session":"1.1","spec":{"UIDMappings":[{"HostID":4294967294,"NamespaceID":0,"Size":1},{"HostID":1,"NamespaceID":1,"Size":4294967293}],"GIDMappings":[{"HostID":4294967294,"NamespaceID":0,"Size":1},{"HostID":1,"NamespaceID":1,"Size":4294967293}],"StoreSizeBytes":9223372020747599872},"storePath":"/var/vcap/data/grootfs/store/unprivileged"}} {"timestamp":"1594644900.969948530","source":"grootfs","message":"grootfs.init-store.store-manager-init-store.truncating-backing-store-file-failed","log_level":2,"data":{"backingstoreFile":"/var/vcap/data/grootfs/store/unprivileged.backing-store","error":"truncate /var/vcap/data/grootfs/store/unprivileged.backing-store: file too large","session":"1.1","size":9223372020747599872,"spec":{"UIDMappings":[{"HostID":4294967294,"NamespaceID":0,"Size":1},{"HostID":1,"NamespaceID":1,"Size":4294967293}],"GIDMappings":[{"HostID":4294967294,"NamespaceID":0,"Size":1},{"HostID":1,"NamespaceID":1,"Size":4294967293}],"StoreSizeBytes":9223372020747599872},"storePath":"/var/vcap/data/grootfs/store/unprivileged"}} {"timestamp":"1594644900.970129967","source":"grootfs","message":"grootfs.init-store.init-store-failed","log_level":2,"data":{"error":"truncating backing store file: truncate /var/vcap/data/grootfs/store/unprivileged.backing-store: file too large","session":"1"}} truncate /var/vcap/data/grootfs/store/unprivileged.backing-store: file too large image

To Reproduce Trying to deploy KubeCF on the cluster deployed on BOSH ( aws env). The cluster is having enabled container privileges.

andreas-kupries commented 4 years ago

Hi @lordcf

Is it possible for you to retest with a newer release of kubecf ? There is kubecf 2.2.3 in CAP 2.0.1.

Further, do I understand "deploy KubeCF on the cluster deployed on BOSH ( aws env)" correctly as

Or are you talking about

As kubecf is kubernetes-based I am very confused by the reference to BOSH in your description.

If you have the same issue with 2.2.3 as with 2.2.2 then it would be very helpful if you could provide the exact session and steps you used to deploy the cluster and kubecf for us to review.

@gak @satadruroy FYI

fargozhu commented 3 years ago

@lordcf there's a more recent version of KubeCF in case you wanna give it a try and see if your problem has been fixed there.

ghost commented 3 years ago

Hi @lordcf

Is it possible for you to retest with a newer release of kubecf ? There is kubecf 2.2.3 in CAP 2.0.1.

Further, do I understand "deploy KubeCF on the cluster deployed on BOSH ( aws env)" correctly as

  • You are setting up a BOSH system on AWS, and then deploy kubecf into that BOSH system ?

Or are you talking about

  • Setting up a kubernetes cluster on AWS, and deploying kubecf into that cluster ?

As kubecf is kubernetes-based I am very confused by the reference to BOSH in your description.

If you have the same issue with 2.2.3 as with 2.2.2 then it would be very helpful if you could provide the exact session and steps you used to deploy the cluster and kubecf for us to review.

@gak @satadruroy FYI

Hello @gak @satadruroy @fargozhu

We are using follwoing steps in order to install KubeCF:

1) Install bosh 2) deploy cluster on bosh 3) deploy cf-operator and kubecf on cluster

Tried Kubecf deployment with v2.2.3 and with v2.3.0, still the diego-api pod is going into "CrashLoopBackOff" state.

Getting below error:

$ kubectl logs diego-api-0 -n kubecf -c bbs-bbs
{"timestamp":"2020-10-15T08:45:28.039378053Z","level":"info","source":"bbs","message":"bbs.starting","data":{}} Failed 'curl --fail --silent http://0.0.0.0:8890/ping' on attempt 1 {"timestamp":"2020-10-15T08:45:29.086871916Z","level":"fatal","source":"bbs","message":"bbs.sql-failed-to-connect","data":{"error":"dial tcp 10.100.200.113:3306: connect: connection refused","trace":"goroutine 1 [running]:\ncode.cloudfoundry.org/lager.(*logger).Fatal(0xc00025e180, 0xe8cd3f, 0x15, 0xff7180, 0xc0003e8000, 0x0, 0x0, 0x0)\n\t/var/vcap/source/bbs/src/code.cloudfoundry.org/lager/logger.go:138 +0xc6\nmain.main()\n\t/var/vcap/source/bbs/src/code.cloudfoundry.org/bbs/cmd/bbs/main.go:140 +0x49f6\n"}} Failed 'curl --fail --silent http://0.0.0.0:8890/ping' on attempt 2 panic: dial tcp 10.100.200.113:3306: connect: connection refused

goroutine 1 [running]: code.cloudfoundry.org/lager.(*logger).Fatal(0xc00025e180, 0xe8cd3f, 0x15, 0xff7180, 0xc0003e8000, 0x0, 0x0, 0x0) /var/vcap/source/bbs/src/code.cloudfoundry.org/lager/logger.go:162 +0x582 main.main() /var/vcap/source/bbs/src/code.cloudfoundry.org/bbs/cmd/bbs/main.go:140 +0x49f6

Getting logs for database pod:

$ kubectl logs database-0 -n kubecf -c database I AM database-0 - 10.200.67.13 Warning: resolveip is deprecated and will be removed in a future version. resolveip: Unable to find hostid for 'database-repl': host not found I am the Primary Node Removing pending files in /var/lib/mysql, because sentinel was not reached Running --initialize-insecure on /var/lib/mysql total 8.0K drwxrws--x 2 mysql mysql 6.0K Oct 15 09:04 . drwxr-xr-x 1 root root 4.0K Feb 25 2020 .. Finished --initialize-insecure MySQL init process in progress... MySQL init process in progress... MySQL init process in progress... MySQL init process failed.

viovanov commented 3 years ago

@lordcf it doesn't seem like something we can debug on our end - because you're using a unique setup. The error you're getting is pointing to some dns issues, have you checked that?