Updating vms fails due to lack of space in /var/vcap/data

jhvhs commented 5 years ago

Is your feature request related to a problem? Please describe. When the deployed VM uses a large-sized release, updating it after a long uptime may fail with an unclear error message. In our case the failure was caused by the fact that both packages and sys/logs are stored under a fixed-size /var/vcap/data partition.

Our VM was using the following releases:

bosh-dns/1.10.0
bosh-dns-aliases/0.0.3
bpm/1.0.4
datadog-agent/2.7.6100
dedicated-mysql/0.83.2
loggregator-agent/2.2
mysql-backup/2.9.0
mysql-monitoring/9.3.0
pxc/0.15.0
routing/0.184.0
service-backup/18.2.0
syslog-migration/11.1.1

When the logs directory contained over several hundreds of megabytes of data, the upgrade invariably failed with the following error message: Response exceeded maximum allowed length which was caused by the crash due to insufficient disk space.

Describe the solution you'd like The ideal solution would be to change the way new releases are applied to the VM during the update:

The agent determines which packages actually need updating, instead of blindly installing all of the packages
The agent verifies whether there is sufficient space to make the update, and if there is insufficient space: i. If enough space could be acquired by wiping out the logs, tmp and friends, do it ii. If there is no way to recover enough space, nuke the vm and create a new one with the new release packages installed iii. If there is no way the packages could fit into /var/vcap/data, fail with a clear error message explaining the problem.

Describe alternatives you've considered For us, running bosh recreate to clean up the Vm was sufficient, once we have spent about an hour digging through the logs to discover the issue.

Additional context This is the layout of the /var/vcap/data/packages on the VMs:

size before the update	size on failed VM 1	size on failed VM 2	package name
14M	14M	14M	blackbox
41M	41M	41M	bosh-dns
18M	38M	38M	bpm
9.4M	19M	19M	configure-leader-follower
460M	460M	460M	dd-agent
2.2M	4.3M	4.3M	generate-auto-tune-mysql
30M	30M	30M	loggregator_agent
12M	24M	24M	mysql-agent
13M	26M	26M	mysql-metrics
6.8M	14M	14M	mysql-restore
445M	`892M`	`445M` insufficient space	percona-server
12K	20K	20K	pid-utils
24M	50M	50M	service-backup
66M	132M	132M	service-backup_aws-cli
225M	450M	450M	service-backup_blobxfer
190M	379M	379M	service-backup_python
12K	20K	20K	service-backup_utils
7.7M	16M	16M	streaming-mysql-backup-client
7.7M	16M	16M	streaming-mysql-backup-tool
56M	111M	111M	thermostat
562M	562M insufficient space	1.1G	xtrabackup
2.2Gb	3.4Gb	3.5Gb	<== total space used

According to df -h on the xenial stemcell 170.48, the size of the partition mounted as /var/vcap/data is 3.9Gb

cf-gitbot commented 5 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/165675450

The labels on this github issue will be updated when the story is started.

bengtrj commented 5 years ago

Hi, I'm interested in this issue as well. I had a similar problem upgrading PKS: the upgrade-all-clusters errand failed and the bosh task shows this cryptic error:

Error: Action Failed get_task: Task 68ef0800-7df3-439a-7cd5-bc141ec2dc08 result: 1 of 5 post-start scripts failed. Failed Jobs: sink-resources-images. Successful Jobs: bosh-dns, telemetry-agent-image, wavefront-proxy-images, kubelet.

To be able to diagnose it, I had to ssh into the worker VM and run the job post-deploy script again, which shows:

worker/bfc02afd-f283-4c35-94c1-3b6ba26d6198:/var/vcap/jobs/sink-resources-images# bin/post-start
[Thu May  2 15:34:46 UTC 2019 Loading cached container: /var/vcap/packages/sink-agent/container-images/oratos_cert-generator.v0.12.tgz
Loaded image: oratos/cert-generator:v0.12
(...)
[Thu May  2 15:34:53 UTC 2019 Loading cached container: /var/vcap/packages/sink-agent/container-images/oratos_fluent-bit-out-syslog.v0.11.1.tgz

gzip: /var/vcap/packages/sink-agent/container-images/oratos_fluent-bit-out-syslog.v0.11.1.tar: No space left on device

Out of curiosity:

worker/bfc02afd-f283-4c35-94c1-3b6ba26d6198:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        2.0G     0  2.0G   0% /dev
tmpfs           2.0G     0  2.0G   0% /dev/shm
tmpfs           2.0G  221M  1.8G  12% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           2.0G     0  2.0G   0% /sys/fs/cgroup
/dev/sda1       2.8G  1.4G  1.3G  54% /
/dev/sda3       4.1G  4.1G     0 100% /var/vcap/data
tmpfs           1.0M   32K  992K   4% /var/vcap/data/sys/run
/dev/sdb1        50G  3.2G   44G   7% /var/vcap/store

cjnosal commented 5 years ago

The bosh director can't provide a clear message or check for 'sufficient space' because it doesn't know the implementation of individual bosh releases. There may be enough space for the bosh-agent to download the package blob, but bosh can't know how much space a running job will consume (e.g. in this case downloading container images). bosh-agent runs the scripts provided by the release, and reports back the exit code.

Bosh should not indiscriminately delete logs. Log retention and forwarding should be managed by the operator, e.g. by using syslog release.

There is also a bosh metrics forwarder for monitoring VM resources.

mrosecrance commented 4 years ago

Closing as there are other supported ways to detect these issues.

asayles commented 2 years ago

@mrosecrance What are they?

mrosecrance commented 2 years ago

We recommend users use something like bosh metrics forwarder to monitor VM resources. I'm fairly sure (I don't have an env to double check right this moment) that users changing the ephemeral disk size will change /var/vcap/data size so operators could set up alerts, see that the ephemeral disk is filling up and bump the disk size.

cloudfoundry / bosh

Updating vms fails due to lack of space in /var/vcap/data #2178