Closed jhvhs closed 4 years ago
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/165675450
The labels on this github issue will be updated when the story is started.
Hi, I'm interested in this issue as well.
I had a similar problem upgrading PKS: the upgrade-all-clusters
errand failed and the bosh task shows this cryptic error:
Error: Action Failed get_task: Task 68ef0800-7df3-439a-7cd5-bc141ec2dc08 result: 1 of 5 post-start scripts failed. Failed Jobs: sink-resources-images. Successful Jobs: bosh-dns, telemetry-agent-image, wavefront-proxy-images, kubelet.
To be able to diagnose it, I had to ssh
into the worker VM and run the job post-deploy script again, which shows:
worker/bfc02afd-f283-4c35-94c1-3b6ba26d6198:/var/vcap/jobs/sink-resources-images# bin/post-start
[Thu May 2 15:34:46 UTC 2019 Loading cached container: /var/vcap/packages/sink-agent/container-images/oratos_cert-generator.v0.12.tgz
Loaded image: oratos/cert-generator:v0.12
(...)
[Thu May 2 15:34:53 UTC 2019 Loading cached container: /var/vcap/packages/sink-agent/container-images/oratos_fluent-bit-out-syslog.v0.11.1.tgz
gzip: /var/vcap/packages/sink-agent/container-images/oratos_fluent-bit-out-syslog.v0.11.1.tar: No space left on device
Out of curiosity:
worker/bfc02afd-f283-4c35-94c1-3b6ba26d6198:~$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 2.0G 0 2.0G 0% /dev
tmpfs 2.0G 0 2.0G 0% /dev/shm
tmpfs 2.0G 221M 1.8G 12% /run
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 2.0G 0 2.0G 0% /sys/fs/cgroup
/dev/sda1 2.8G 1.4G 1.3G 54% /
/dev/sda3 4.1G 4.1G 0 100% /var/vcap/data
tmpfs 1.0M 32K 992K 4% /var/vcap/data/sys/run
/dev/sdb1 50G 3.2G 44G 7% /var/vcap/store
The bosh director can't provide a clear message or check for 'sufficient space' because it doesn't know the implementation of individual bosh releases. There may be enough space for the bosh-agent to download the package blob, but bosh can't know how much space a running job will consume (e.g. in this case downloading container images). bosh-agent runs the scripts provided by the release, and reports back the exit code.
Bosh should not indiscriminately delete logs. Log retention and forwarding should be managed by the operator, e.g. by using syslog release.
There is also a bosh metrics forwarder for monitoring VM resources.
Closing as there are other supported ways to detect these issues.
@mrosecrance What are they?
We recommend users use something like bosh metrics forwarder to monitor VM resources. I'm fairly sure (I don't have an env to double check right this moment) that users changing the ephemeral disk size will change /var/vcap/data size so operators could set up alerts, see that the ephemeral disk is filling up and bump the disk size.
Is your feature request related to a problem? Please describe. When the deployed VM uses a large-sized release, updating it after a long uptime may fail with an unclear error message. In our case the failure was caused by the fact that both
packages
andsys/logs
are stored under a fixed-size/var/vcap/data
partition.Our VM was using the following releases:
bosh-dns/1.10.0
bosh-dns-aliases/0.0.3
bpm/1.0.4
datadog-agent/2.7.6100
dedicated-mysql/0.83.2
loggregator-agent/2.2
mysql-backup/2.9.0
mysql-monitoring/9.3.0
pxc/0.15.0
routing/0.184.0
service-backup/18.2.0
syslog-migration/11.1.1
When the
logs
directory contained over several hundreds of megabytes of data, the upgrade invariably failed with the following error message:Response exceeded maximum allowed length
which was caused by the crash due to insufficient disk space.Describe the solution you'd like The ideal solution would be to change the way new releases are applied to the VM during the update:
logs
,tmp
and friends, do it ii. If there is no way to recover enough space, nuke the vm and create a new one with the new release packages installed iii. If there is no way the packages could fit into/var/vcap/data
, fail with a clear error message explaining the problem.Describe alternatives you've considered For us, running
bosh recreate
to clean up the Vm was sufficient, once we have spent about an hour digging through the logs to discover the issue.Additional context This is the layout of the
/var/vcap/data/packages
on the VMs:892M
445M
insufficient spaceAccording to
df -h
on the xenial stemcell 170.48, the size of the partition mounted as/var/vcap/data
is 3.9Gb