cloudfoundry / bosh

Cloud Foundry BOSH is an open source tool chain for release engineering, deployment and lifecycle management of large scale distributed services.
https://bosh.io
Apache License 2.0
2.04k stars 657 forks source link

Updating vms fails due to lack of space in /var/vcap/data #2178

Closed jhvhs closed 4 years ago

jhvhs commented 5 years ago

Is your feature request related to a problem? Please describe. When the deployed VM uses a large-sized release, updating it after a long uptime may fail with an unclear error message. In our case the failure was caused by the fact that both packages and sys/logs are stored under a fixed-size /var/vcap/data partition.

Our VM was using the following releases:

When the logs directory contained over several hundreds of megabytes of data, the upgrade invariably failed with the following error message: Response exceeded maximum allowed length which was caused by the crash due to insufficient disk space.

Describe the solution you'd like The ideal solution would be to change the way new releases are applied to the VM during the update:

  1. The agent determines which packages actually need updating, instead of blindly installing all of the packages
  2. The agent verifies whether there is sufficient space to make the update, and if there is insufficient space: i. If enough space could be acquired by wiping out the logs, tmp and friends, do it ii. If there is no way to recover enough space, nuke the vm and create a new one with the new release packages installed iii. If there is no way the packages could fit into /var/vcap/data, fail with a clear error message explaining the problem.

Describe alternatives you've considered For us, running bosh recreate to clean up the Vm was sufficient, once we have spent about an hour digging through the logs to discover the issue.

Additional context This is the layout of the /var/vcap/data/packages on the VMs:

size before the update size on failed VM 1 size on failed VM 2 package name
14M 14M 14M blackbox
41M 41M 41M bosh-dns
18M 38M 38M bpm
9.4M 19M 19M configure-leader-follower
460M 460M 460M dd-agent
2.2M 4.3M 4.3M generate-auto-tune-mysql
30M 30M 30M loggregator_agent
12M 24M 24M mysql-agent
13M 26M 26M mysql-metrics
6.8M 14M 14M mysql-restore
445M 892M 445M insufficient space percona-server
12K 20K 20K pid-utils
24M 50M 50M service-backup
66M 132M 132M service-backup_aws-cli
225M 450M 450M service-backup_blobxfer
190M 379M 379M service-backup_python
12K 20K 20K service-backup_utils
7.7M 16M 16M streaming-mysql-backup-client
7.7M 16M 16M streaming-mysql-backup-tool
56M 111M 111M thermostat
562M 562M insufficient space 1.1G xtrabackup
2.2Gb 3.4Gb 3.5Gb <== total space used

According to df -h on the xenial stemcell 170.48, the size of the partition mounted as /var/vcap/data is 3.9Gb

cf-gitbot commented 5 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/165675450

The labels on this github issue will be updated when the story is started.

bengtrj commented 5 years ago

Hi, I'm interested in this issue as well. I had a similar problem upgrading PKS: the upgrade-all-clusters errand failed and the bosh task shows this cryptic error:

Error: Action Failed get_task: Task 68ef0800-7df3-439a-7cd5-bc141ec2dc08 result: 1 of 5 post-start scripts failed. Failed Jobs: sink-resources-images. Successful Jobs: bosh-dns, telemetry-agent-image, wavefront-proxy-images, kubelet.

To be able to diagnose it, I had to ssh into the worker VM and run the job post-deploy script again, which shows:

worker/bfc02afd-f283-4c35-94c1-3b6ba26d6198:/var/vcap/jobs/sink-resources-images# bin/post-start
[Thu May  2 15:34:46 UTC 2019 Loading cached container: /var/vcap/packages/sink-agent/container-images/oratos_cert-generator.v0.12.tgz
Loaded image: oratos/cert-generator:v0.12
(...)
[Thu May  2 15:34:53 UTC 2019 Loading cached container: /var/vcap/packages/sink-agent/container-images/oratos_fluent-bit-out-syslog.v0.11.1.tgz

gzip: /var/vcap/packages/sink-agent/container-images/oratos_fluent-bit-out-syslog.v0.11.1.tar: No space left on device

Out of curiosity:

worker/bfc02afd-f283-4c35-94c1-3b6ba26d6198:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        2.0G     0  2.0G   0% /dev
tmpfs           2.0G     0  2.0G   0% /dev/shm
tmpfs           2.0G  221M  1.8G  12% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           2.0G     0  2.0G   0% /sys/fs/cgroup
/dev/sda1       2.8G  1.4G  1.3G  54% /
/dev/sda3       4.1G  4.1G     0 100% /var/vcap/data
tmpfs           1.0M   32K  992K   4% /var/vcap/data/sys/run
/dev/sdb1        50G  3.2G   44G   7% /var/vcap/store
cjnosal commented 5 years ago

The bosh director can't provide a clear message or check for 'sufficient space' because it doesn't know the implementation of individual bosh releases. There may be enough space for the bosh-agent to download the package blob, but bosh can't know how much space a running job will consume (e.g. in this case downloading container images). bosh-agent runs the scripts provided by the release, and reports back the exit code.

Bosh should not indiscriminately delete logs. Log retention and forwarding should be managed by the operator, e.g. by using syslog release.

There is also a bosh metrics forwarder for monitoring VM resources.

mrosecrance commented 4 years ago

Closing as there are other supported ways to detect these issues.

asayles commented 2 years ago

@mrosecrance What are they?

mrosecrance commented 2 years ago

We recommend users use something like bosh metrics forwarder to monitor VM resources. I'm fairly sure (I don't have an env to double check right this moment) that users changing the ephemeral disk size will change /var/vcap/data size so operators could set up alerts, see that the ephemeral disk is filling up and bump the disk size.