GrootFS fails to mount XFS filesystem: `exit status 32: mount: wrong fs type, bad option, bad superblock on /dev/loop0`

JY-Lee commented 6 years ago

Hello,

When installing diego in vSphere environment, following error occurs : ############################################################# Started Updating instance Started Updating instance > database_z1/2c9a0212-55c1-4c92-9be7-49a3e1920041 (0) (canary) Done Updating instance > database_z1/2c9a0212-55c1-4c92-9be7-49a3e1920041 (0) (canary) Started Updating instance > brain_z1/40c56e23-ef82-4542-9ff8-77543f1d6507 (0) (canary) Done Updating instance > brain_z1/40c56e23-ef82-4542-9ff8-77543f1d6507 (0) (canary) Started Updating instance > cell_z1/ce3f079d-6e98-4ea5-b584-0e28da2eff2b (0) (canary) Started Updating instance > cc_bridge_z1/cb6d6806-8c63-415e-83cb-cc0467f20adc (0) (canary) Started Updating instance > route_emitter_z1/83a66350-ac04-423f-81cd-df2938d1bfd4 (0) (canary) Started Updating instance > access_z1/1c6d1586-9137-4895-a579-03712bcf728c (0) (canary) Done Updating instance > route_emitter_z1/83a66350-ac04-423f-81cd-df2938d1bfd4 (0) (canary) Done Updating instance > cc_bridge_z1/cb6d6806-8c63-415e-83cb-cc0467f20adc (0) (canary) Done Updating instance > access_z1/1c6d1586-9137-4895-a579-03712bcf728c (0) (canary) Failed Updating instance > cell_z1/ce3f079d-6e98-4ea5-b584-0e28da2eff2b (0) (canary) Error Code : 400007, Message :'cell_z1/0 (ce3f079d-6e98-4ea5-b584-0e28da2eff2b)' is not running after update. Review logs for failed jobs: rep, garden, metron_agent #############################################################

monit message was found as: ############################################################# chmod u+s /var/vcap/packages/grootfs/bin/tardis {"timestamp":"1521080887.013528109","source":"grootfs","message":"grootfs.init-store.store-manager-init-store.overlayxfs-init-filesystem.mounting-filesystem-failed","log_level":2,"data":{"error":"exit status 32: mount: wrong fs type, bad option, bad superblock on /dev/loop0,\n missing codepage or helper program, or other error\n In some cases useful info is found in syslog - try\n dmesg | tail or so\n\n","filesystemPath":"/var/vcap/data/grootfs/store/unprivileged.backing-store","session":"1.1.2","spec":{"UIDMappings":[{"HostID":4294967294,"NamespaceID":0,"Size":1},{"HostID":1,"NamespaceID":1,"Size":4294967293}],"GIDMappings":[{"HostID":4294967294,"NamespaceID":0,"Size":1},{"HostID":1,"NamespaceID":1,"Size":4294967293}],"StoreSizeBytes":25571164160},"storePath":"/var/vcap/data/grootfs/store/unprivileged"}} {"timestamp":"1521080887.013750792","source":"grootfs","message":"grootfs.init-store.store-manager-init-store.initializing-filesystem-failed","log_level":2,"data":{"backingstoreFile":"/var/vcap/data/grootfs/store/unprivileged.backing-store","error":"Mounting filesystem: exit status 32: mount: wrong fs type, bad option, bad superblock on /dev/loop0,\n missing codepage or helper program, or other error\n In some cases useful info is found in syslog - try\n dmesg | tail or so\n\n","session":"1.1","spec":{"UIDMappings":[{"HostID":4294967294,"NamespaceID":0,"Size":1},{"HostID":1,"NamespaceID":1,"Size":4294967293}],"GIDMappings":[{"HostID":4294967294,"NamespaceID":0,"Size":1},{"HostID":1,"NamespaceID":1,"Size":4294967293}],"StoreSizeBytes":25571164160},"storePath":"/var/vcap/data/grootfs/store/unprivileged"}} {"timestamp":"1521080887.013805628","source":"grootfs","message":"grootfs.init-store.cleaning-up-store-failed","log_level":2,"data":{"error":"initializing filesyztem: Mounting filesystem: exit status 32: mount: wrong fs type, bad option, bad superblock on /dev/loop0,\n missing codepage or helper program, or other error\n In some cases useful info is found in syslog - try\n dmesg | tail or so\n\n","session":"1"}} #############################################################

My environments were:

cf-release version 287 diego-release version 1.34.0 garden-runc-release version 1.11.0 cflinuxfs-release version 1.185.0

similer issue was found here : https://lists.cloudfoundry.org/g/cf-dev/topic/14491832#7849

Can any one help on this issue? ( Error occurs when installing diego in vSphere environmet ) Thanks in Advance. :+1:

Best Regards, JY

cf-gitbot commented 6 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/156001041

The labels on this github issue will be updated when the story is started.

julz commented 6 years ago

Hi @JY-Lee, it looks like we weren't able to create the xfs filesystem we use to enforce container quotas. Could you check which stemcell version (and especially what kernel is in the stemcell) you are using? Thanks!

JY-Lee commented 6 years ago

Hi @julz , this was with stemcell version of bosh-vsphere-esxi-ubuntu-trusty-go_agent | ubuntu-trusty | 3468.21 and i am not sure about the kernel as i have deleted diego. Plus, v3445.2 stemcell also had same result. The thing is,,, i am working with bosh version 261 and have succeded depolyment on 'Openstack' using the same manifest.

Thank you!!!

julz commented 6 years ago

Hi @JY-Lee - it looks like the latest vsphere stemcell is 3541.9, could you please try upgrading the stemcell to the latest version and let me know if the problem still occurs? If it does, could you also please bosh ssh in to a VM and run uname -r to check the kernel version. Thanks!

AmitRoushan commented 6 years ago

Hi @JY-Lee , Recently, I also came across same issue. I was using "3.13.0-142-generic" kernel version. But when i upgraded it to 4.4.0-105-generic, it worked for me.

Thanks

JY-Lee commented 6 years ago

Thank you, @julz and @AmitRoushan . For v 3468.21 Stemcell >> diego cell kernal version was "4.4.0-111-generic" and for v 3541.9 Stemcell >> diego cell kernal version was "4.4.0-116-generic"

I did success deployment, but it only happens when i deploy it twice. ( succeeded only in second execution ) Or it takes some time after the occurring error to have $ bosh vms cell state change to "running"

Both of the cases occurs error of following at the first deployment: #################################################### /cb4a956c-ce5d-41cd-ad96-51e90adcab6f:/var/vcap/sys/log/monit$ dmesg | tail dmesg: klogctl failed: Operation not permitted /cb4a956c-ce5d-41cd-ad96-51e90adcab6f:/var/vcap/sys/log/monit$ sudo dmesg | tail [sudo] password for vcap: [ 839.720424] bridge: automatic filtering via arp/ip/ip6tables has been deprecated. Update your scripts to load br_netfilter if you need this. [ 839.746738] device w7224skiqi4t-0 entered promiscuous mode [ 839.746935] wbrdg-0afe0000: port 1(w7224skiqi4t-0) entered forwarding state [ 839.746946] wbrdg-0afe0000: port 1(w7224skiqi4t-0) entered forwarding state [ 839.768064] wbrdg-0afe0000: port 1(w7224skiqi4t-0) entered disabled state [ 839.869393] wbrdg-0afe0000: port 1(w7224skiqi4t-0) entered forwarding state [ 839.869404] wbrdg-0afe0000: port 1(w7224skiqi4t-0) entered forwarding state [ 840.057631] wbrdg-0afe0000: port 1(w7224skiqi4t-0) entered disabled state [ 840.061700] device w7224skiqi4t-0 left promiscuous mode [ 840.061719] wbrdg-0afe0000: port 1(w7224skiqi4t-0) entered disabled state /cb4a956c-ce5d-41cd-ad96-51e90adcab6f:/var/vcap/sys/log/monit$ /cb4a956c-ce5d-41cd-ad96-51e90adcab6f:/var/vcap/sys/log/monit$ uname -r 4.4.0-116-generic ####################################################

## I would like to know if there's any way to make it succeed by my first deployment? (by first chance; deploying one time only )

Thnak you very much!!

JY-Lee commented 6 years ago

Also, ## does vSphere Starndard env support grootfs? I am curious as i am working on vSphere Starndard env for this. Thanks.

Callisto13 commented 6 years ago

Hey @JY-Lee we are still looking into your initial mounting failure, and will let you know if you make any progress. As for garden not succeeding in a first deploy, but then succeeding in a second; we have seen recently that some slow-to-deploy environments are hitting a harsh default timeout, and that garden does not report as running in that time which makes the deployment fail with no visible error. We have bumped this in a newer release. Have you seen any other errors recently?

JY-Lee commented 6 years ago

Hi @Callisto13 , thank you for your time.

As you suggested, I have updated to 'garden-runc v1.12.1' and tested on few different environments.

However, it came up with an error as addressed below. ################################################################# Are you sure you want to deploy? (type 'yes' to continue): yes

Director task 104 Deprecation: Ignoring cloud config. Manifest contains 'networks' section.

Started preparing deployment > Preparing deployment. Done (00:00:02)

Started preparing package compilation > Finding packages to compile. Done (00:00:00)

Started creating missing vms Started creating missing vms > database_z1/ea20cde2-2698-4a3d-81b2-04616c9b3742 (0) Started creating missing vms > cc_bridge_z1/0b15bc81-b942-49a3-8620-e0a791a1711d (0) Started creating missing vms > route_emitter_z1/b0c4fadf-39e9-4f31-ad91-01819e7f18da (0) Started creating missing vms > cell_z1/f81dfd88-fe7b-4c9f-9249-0d3ec7bd0ca8 (0) Started creating missing vms > brain_z1/5e6bd0ca-d531-4bf8-95c3-ca5a8f097076 (0) Started creating missing vms > access_z1/f5ca3015-c0b2-4173-9850-33d968c3e870 (0) Done creating missing vms > database_z1/ea20cde2-2698-4a3d-81b2-04616c9b3742 (0) (00:05:16) Done creating missing vms > access_z1/f5ca3015-c0b2-4173-9850-33d968c3e870 (0) (00:05:16) Done creating missing vms > brain_z1/5e6bd0ca-d531-4bf8-95c3-ca5a8f097076 (0) (00:05:54) Done creating missing vms > cc_bridge_z1/0b15bc81-b942-49a3-8620-e0a791a1711d (0) (00:05:56) Done creating missing vms > route_emitter_z1/b0c4fadf-39e9-4f31-ad91-01819e7f18da (0) (00:05:57) Done creating missing vms > cell_z1/f81dfd88-fe7b-4c9f-9249-0d3ec7bd0ca8 (0) (00:06:07) Done creating missing vms (00:06:07)

Started updating instance database_z1 > database_z1/ea20cde2-2698-4a3d-81b2-04616c9b3742 (0) (canary). Done (00:01:24) Started updating instance brain_z1 > brain_z1/5e6bd0ca-d531-4bf8-95c3-ca5a8f097076 (0) (canary). Done (00:00:44) Started updating instance cc_bridge_z1 > cc_bridge_z1/0b15bc81-b942-49a3-8620-e0a791a1711d (0) (canary) Started updating instance route_emitter_z1 > route_emitter_z1/b0c4fadf-39e9-4f31-ad91-01819e7f18da (0) (canary) Started updating instance access_z1 > access_z1/f5ca3015-c0b2-4173-9850-33d968c3e870 (0) (canary) Started updating instance cell_z1 > cell_z1/f81dfd88-fe7b-4c9f-9249-0d3ec7bd0ca8 (0) (canary) Done updating instance route_emitter_z1 > route_emitter_z1/b0c4fadf-39e9-4f31-ad91-01819e7f18da (0) (canary) (00:01:09) Done updating instance cc_bridge_z1 > cc_bridge_z1/0b15bc81-b942-49a3-8620-e0a791a1711d (0) (canary) (00:01:30) Done updating instance access_z1 > access_z1/f5ca3015-c0b2-4173-9850-33d968c3e870 (0) (canary) (00:01:40) Failed updating instance cell_z1 > cell_z1/f81dfd88-fe7b-4c9f-9249-0d3ec7bd0ca8 (0) (canary): 'cell_z1/0 (f81dfd88-fe7b-4c9f-9249-0d3ec7bd0ca8)' is not running after update. Review logs for failed jobs: consul_agent, rep, garden, metron_agent (00:05:26)

Error 400007: 'cell_z1/0 (f81dfd88-fe7b-4c9f-9249-0d3ec7bd0ca8)' is not running after update. Review logs for failed jobs: consul_agent, rep, garden, metron_agent ################################################################# Different from before, you can see failed job list have changed,

but /var/vcap/data/sys/log/monit/garden.err.log file continues to throw error as addressed below. ################################################################# {"timestamp":"1521677187.187122345","source":"grootfs","message":"grootfs.init-store.store-manager-init-store.overlayxfs-init-filesystem.mounting-filesystem-failed","log_level":2,"data":{"error":"exit status 32: mount: wrong fs type, bad option, bad superblock on /dev/loop0,\n missing codepage or helper program, or other error\n In some cases useful info is found in syslog - try\n dmesg | tail or so\n\n","filesystemPath":"/var/vcap/data/grootfs/store/unprivileged.backing-store","session":"1.1.2","spec":{"UIDMappings":[{"HostID":4294967294,"NamespaceID":0,"Size":1},{"HostID":1,"NamespaceID":1,"Size":4294967293}],"GIDMappings":[{"HostID":4294967294,"NamespaceID":0,"Size":1},{"HostID":1,"NamespaceID":1,"Size":4294967293}],"StoreSizeBytes":17118453760},"storePath":"/var/vcap/data/grootfs/store/unprivileged"}} {"timestamp":"1521677187.187420130","source":"grootfs","message":"grootfs.init-store.store-manager-init-store.initializing-filesystem-failed","log_level":2,"data":{"backingstoreFile":"/var/vcap/data/grootfs/store/unprivileged.backing-store","error":"Mounting filesystem: exit status 32: mount: wrong fs type, bad option, bad superblock on /dev/loop0,\n missing codepage or helper program, or other error\n In some cases useful info is found in syslog - try\n dmesg | tail or so\n\n","session":"1.1","spec":{"UIDMappings":[{"HostID":4294967294,"NamespaceID":0,"Size":1},{"HostID":1,"NamespaceID":1,"Size":4294967293}],"GIDMappings":[{"HostID":4294967294,"NamespaceID":0,"Size":1},{"HostID":1,"NamespaceID":1,"Size":4294967293}],"StoreSizeBytes":17118453760},"storePath":"/var/vcap/data/grootfs/store/unprivileged"}} {"timestamp":"1521677187.187511206","source":"grootfs","message":"grootfs.init-store.cleaning-up-store-failed","log_level":2,"data":{"error":"initializing filesyztem: Mounting filesystem: exit status 32: mount: wrong fs type, bad option, bad superblock on /dev/loop0,\n missing codepage or helper program, or other error\n In some cases useful info is found in syslog - try\n dmesg | tail or so\n\n","session":"1"}} #################################################################

It used to throw this kind of error at least 2 times or more before, but now it only throws once.

The log below is the change point of data volume. ################################################################# garden-runce v 1.11.1 == -rw-r--r-- 1 root root 35612 Mar 22 00:06 garden.err.log

garden-runce v 1.12.1 == -rw-r--r-- 1 root root 17468 Mar 22 00:06 garden.err.log #################################################################

I suppose having 'timeout' replaced to 2mins from 30secs in 'garden-runce-release v1.12.'1 have effected the number of error occurring. If so, ## is there any way (or function) to manipulate 'timeout' in 'deigo installation manifest' file??

FYI, i have tested with 'garden_healthcheck.timeout' by adjusting the time and it also threw same error.

Thank you very much!!

Callisto13 commented 6 years ago

hey @JY-Lee, i suggested the monit timeout as a potential cause for the second issue you mentioned:

I did success deployment, but it only happens when i deploy it twice. ( succeeded only in second execution )

There is no way to configure the monit timeout from the manifest I am afraid. The increased monit timeout has produced the unintended side-effect of starting the garden ctl script more than once, which is why you are seeing the bad superblock error more frequently now. The garden_healthcheck.timeout is a different setting and would not have lead to either of your problems.

We are still trying to reproduce the issue. Is this a production environment?

Callisto13 commented 6 years ago

I have also changed the title of this issue since the original was very generic and not search friendly for others who may have come across the same thing

JY-Lee commented 6 years ago

Hi, @Callisto13 , This is a testing environment. Thank you very much for your time and effort.

Callisto13 commented 6 years ago

@JY-Lee we are still unable to reproduce so could you try to deploy again? and then right after it fails (assuming it fails for the same reason), ssh in and get the following debug information:

the contents of /var/log/messages and all of dmesg (not with | tail). These may be quite long, so please attach them as files.
the outputs from:
- uname -a
- blkid
- modprobe xfs and lsmod | grep xfs
If modprobe exited 0 and lsmod returned at least one line, please also get the outputs from:
- file -s /var/vcap/data/grootfs/store/unprivileged.backing-store
- xfs_check /var/vcap/data/grootfs/store/unprivileged.backing_store
- cat /proc/self/mountinfo

Thanks!

cloudfoundry / garden-runc-release

GrootFS fails to mount XFS filesystem: `exit status 32: mount: wrong fs type, bad option, bad superblock on /dev/loop0` #60