Occasional disk repair errors on GrowRootAPFSVolume

timsutton commented 8 months ago

Very tangentially-related to https://github.com/aws/ec2-macos-init/issues/41, we've seen a different type of error also occasionally, across a couple versions of macOS 13 (most recently, 13.5 which is what we're currently running):

2023/12/19 10:16:07.759431 Error while running module [GrowRootAPFSVolume] (type: command, group: 3) with message:  and err: ec2macosinit: error executing command [[/bin/zsh -c ec2-macos-utils grow --id root]] with stdout [] and stderr [time="19 Dec 23 10:15 GMT" level=info msg="Configuring diskutil for product" product="macOS Ventura 13.5.0"
time="19 Dec 23 10:15 GMT" level=info msg="Attempting to grow container..." device_id=disk5s2s1
time="19 Dec 23 10:15 GMT" level=info msg="Checking if device can be APFS resized..." device_id=disk5s2s1
time="19 Dec 23 10:15 GMT" level=info msg="Device can be resized"
time="19 Dec 23 10:15 GMT" level=info msg="Repairing the parent disk..."
time="19 Dec 23 10:15 GMT" level=info msg="Repairing parent disk..." parent_id=disk4
time="19 Dec 23 10:15 GMT" level=info msg="Successfully repaired the parent disk"
time="19 Dec 23 10:15 GMT" level=info msg="Fetching amount of free space on device..." device_id=disk5
time="19 Dec 23 10:15 GMT" level=info msg="Resizing container to maximum size..." device_id=disk5 free_space="268 GB"
Error: diskutil: failed to run diskutil command to resize the container, stderr [Error: -69716: Storage system verify or repair failed
]: error waiting for specified command to exit: exit status 1]: exit status 1

In this case, it was a 250GB AMI launched with a 500GB volume that it wanted to grow into.

The impact seemed to be that it's allowed to continue running through the init steps (as it's configured as best effort), but that we end up with a host in production with a smaller volume than was expected. While we have some automated cleanup in place, the size difference was enough that we didn't really have enough free space to reclaim from.

If I run sudo ec2-macos-utils grow --id root later I was able to grow it properly, so I'm guessing that perhaps it is just unlucky timing. We don't see this widely but it's come up in our logs 3-4 times over the past few months.

Just thinking whether there might be some more graceful way to handle this type of error or at least a retry. The timestamps of 10:15:51 -> 10:16:07 is only 16 seconds. I'd be ok to put up with it retrying this one a couple more times if it there's a good chance of it succeeding with a retry. Is there any mechanism supported in ec2-macos-init where we could add retries as a generic thing? (I see this but it looks like it's specific to that one type of task).

Also if https://github.com/aws/ec2-macos-init/issues/41 were addressed such that it doesn't even try to grow the volume if it won't be able to grow it any further, you could probably reduce the total runtime somewhat for the cases where the volume size matches anyway?

jahkeup commented 8 months ago

Thanks for taking the time to report this @timsutton - the incompleted resize in the circumstances here does seem spurious. How frequently do you see instances starting up without successfully resizing?

We'll need take steps to reproduce so that we can narrow down the cause, it could be an issue with the timing of disk metadata read and the updated partition table being reflected. If you have any further info, please share - otherwise the team will dig into this and identify fixes and/or mitigations.

timsutton commented 8 months ago

Sometime in early December, we switched to from booting all our instances with this config:

EBS Volume size: 500GB AMI size: 500GB

to one with this config:

EBS Volume size: 500GB AMI size: 250GB

In other words we changed from a snapshot which had a volume that would never need to be resized, to a volume that would need to. So far I have seen this failure twice.

But I do also recall, previous to this switch, sometimes seeing the GrowRootAPFSVolume task failing. It was similarly infrequent, eg. once a month. We ship the ec2-macos-init logs to a centralized place so we can search it among other sources for errors.

timsutton commented 8 months ago

We've now seen 2-3 more instances of this since my last comment above. We're going to put in our own workaround to retry / force the grow operation, then we should be able to track it over time in terms of how frequently it may be happening.

aws / ec2-macos-init

Occasional disk repair errors on GrowRootAPFSVolume #48