Fail writing snapshot data to filesystem repo if space running out

DaveCTurner commented 3 years ago

A user reported running out of disk space in their shared filesystem repository which left it completely stuck, unable to take any further actions since everything that might delete any existing data (even repository cleanup AFAICT) starts by writing another metadata file to the repository before proceeding and there wasn't even the space to do that.

Perhaps we should refuse to write data blobs (but not metadata blobs) to a shared filesystem repository when it is nearly full, leaving at least a few MB of wiggle room for cleanup and recovery from filling up the disk.

Workaround

When space runs out: a. disable SLM b. ensure there are no ongoing snapshots c. extend the filesystem that contains the repo by 100MiB or so d. delete some snapshots to free up space e. shrink the filesystem to its original size.

Alternative workaround

Ahead of time, create a ~100MiB file in the same filesystem as the repo to reserve some space.
When space runs out: a. disable SLM b. ensure there are no ongoing snapshots c. delete the reserved-space file created in step 1 d. delete some snapshots to free up space e. create another ~100MiB reserved-space file

elasticmachine commented 3 years ago

Pinging @elastic/es-distributed (Team:Distributed)

original-brownbear commented 3 years ago

It's a little tricky to estimate the wiggle room here because we'd effectively need to make sure that we have enough space to write a new index-N at the root as well as a new index-uuid blob for each shard the combined size of which can vary widely, but a IMO a best effort guess of say 100MB should cover pretty much all cases I guess.

DaveCTurner commented 3 years ago

Relates #26730 and associated ERs

DaveCTurner commented 3 years ago

Another user reported the same issue.

DaveCTurner commented 2 years ago

Thinking about this some more, it's very unlikely that a successful snapshot would leave the repo too full to do a delete, and it's pretty likely that we ran out of space in the middle of writing some data blobs. Today I think if that happens then we don't clean the dangling blobs up, we just leave everything in place, so it seems that there's a pretty good chance we can free up enough space to delete some snapshots even if the repo completely runs out of space.

lost2 commented 2 years ago

While waiting for some fix regarding this problem, I use to create a "dummy-file-to-delete" with 500MB on repository directory, that could be manually deleted to free up space, allowing "Delete snapshot" to work again.

jkervine commented 2 years ago

Just encountered this on one of my non-production systems today. Because this was not production, I did the following:

From the snapshot volume/filesystem, moved files from indices directory, which had modification date in ancient history to another filesystem to make room
Delete some snapshots (through snapshot API) which are much newer that ancient history
move those ancient files back to their original place
deleted some more snapshots, now from ancient history

Seemed to work, no errors in logs at least and new snapshots are created and restored ok... am I in for a lot of surprises in the future?

DaveCTurner commented 2 years ago

Seemed to work, no errors in logs at least and new snapshots are created and restored ok... am I in for a lot of surprises in the future?

Maybe. I don't think we can give a more confident answer than that - this is definitely not a supported or tested workflow, and "seemed to work" is a very weak indicator of repository integrity. As the docs say:

Don’t modify anything within the repository or run processes that might interfere with its contents. If something other than Elasticsearch modifies the contents of the repository then future snapshot or restore operations may fail, reporting corruption or other data inconsistencies, or may appear to succeed having silently lost some of your data.

It'd be better if we had an option to do dry-run restores (https://github.com/elastic/elasticsearch/issues/54940) or other kinds of integrity checks (https://github.com/elastic/elasticsearch/issues/52622) but these proposed features are not under active development right now.

lost2 commented 1 year ago

Any news from this "enhancement" request? We're on V8 by now and snapshots keeps filling up file system to no avail. Thanks

maggieghamry commented 1 year ago

@DaveCTurner is there a known workaround for this situation?

DaveCTurner commented 1 year ago

I added some notes on ahead-of-time protection to the OP.

Once you reach this situation then the only workaround is to extend the filesystem (temporarily) and delete some snapshots.

jerrac commented 1 year ago

Is there any current movement on this? For various reasons I wasn't able to address my snapshot disk space issues before they hit 100%. Extending the disk is currently not an option. Right now I can't even get the list of snapshots to load via Kibana or curl. I'm working on getting more space, but for now I'm stuck.

I've use a lot of systems that make how much space to reserve an option. I'd say a simple way to partially fix this is to just add an option for that. Then, before taking a snapshot, check if the threshold is met, if it is, just don't take the snapshot. I'd think that'd be fairly straightforward to implement.

DaveCTurner commented 1 year ago

It's not on anyone's roadmap right now, but it sounds like you're volunteering to take it on @jerrac? If so, a PR would be very welcome. IMO it'd be better to focus on cleaning up dangling blobs after a failed snapshot, as per my earlier comment, but if you'd prefer to try the reserved-space route then we'd appreciate that too.

jerrac commented 1 year ago

@DaveCTurner Er, Java is not really my forte. Haven't actually touched it since an internship 10+ years ago... I did go poke around and found this section of code: https://github.com/elastic/elasticsearch/blob/1ae63ac315487446342fb61bf7122247d468ad5c/server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java#L671-L705 After I got over the oddness that is having the same method defined twice in the same class, that is where I'd imagine it might (emphasis on might) make sense to add a disk space check.

Maybe add a "RepositoryFullException" class, and then somehow throw one when the disk is getting full.

No idea how I'd actually add an option to repository creation that would let users set the threshold.

Anyway, I might poke at setting up and actual dev environment, but I'm not sure I'll have time very soon. :\

Hopefully someone else will have time, and skill, to jump on this soon. :)

DaveCTurner commented 11 months ago

https://github.com/elastic/elasticsearch/pull/99694 should effectively solve this in practice with high probability since it's very likely that you hit the disk-full situation while writing data blobs which are now cleaned up on failure, leaving enough space in the repository for the metadata operations needed to delete snapshots.

That leaves open the (much more remote) possibility that the disk fills up when writing metadata. However most repository implementations do not have a meaningful notion of "space running out" so it turns out to be fairly tricky to implement the idea suggested in the OP in a general-purpose way:

Perhaps we should refuse to write data blobs (but not metadata blobs) to a shared filesystem repository when it is nearly full, leaving at least a few MB of wiggle room for cleanup and recovery from filling up the disk.

Instead maybe we should consider extending https://github.com/elastic/elasticsearch/issues/81352 to allow storing data and metadata in wholly different locations (e.g. different filesystems or S3 buckets) so that we can be sure to have space for metadata operations even if the data location is full. It would also likely help to do https://github.com/elastic/elasticsearch/issues/75623 and https://github.com/elastic/elasticsearch/issues/100115, and https://github.com/elastic/elasticsearch/issues/52622 would also be able to identify dangling data.

Since we're not currently planning to address the remaining possibility that the disk fills up when writing metadata, and there are other open issues to track alternative ideas, I'm closing this.

jerrac commented 11 months ago

I have to admit I'm confused. Why not just actually check that there is enough space before even starting a snapshot?

The proposed solution is that a failed write due to full disk will somehow fix our problems because the blob that failed to write will get deleted, which might leave enough space for metadata to be written.

That's relying on something breaking in order to stop something else from breaking.

If airlines relied on planes failing to take off to determine if the plane had too much stuff in it, would that be an acceptable solution?

Shouldn't it be that we try to stop something from breaking in the first place?

I mean, Elasticsearch has that kind of logic in other places. It limits the total number of shards and will stop allocating indices to filesystems that are almost full. That's all to prevent a problem before it occurs. Right?

DaveCTurner commented 11 months ago

Why not just actually check that there is enough space before even starting a snapshot?

The word "just" is loadbearing in that question :) We can't accurately determine the space needed up front, or at least it would be significant extra computation, because of having to account for deduplication. And then filesystems don't really guarantee that the free space they report means we can actually write that many bytes, because of overheads lost to incomplete blocks and so on. And then there's other users of the same filesystem consuming or freeing space too. And finally none of the cloud repo APIs have a way to even query the available free space.

If airlines had a way to handle failed-to-take-off as gracefully as we now handle disk-full in a repository then I expect they would indeed use that rather than all the effort and procedures (and capacity lost to safety margins) they have today.

jerrac commented 11 months ago

I can get that calculating how much space you need beforehand is not feasible.

But, just like with storing indices, you can check for a percentage of free space and then refuse to start a snapshot if that percentage isn't available.

Would it really matter for s3 apis? I thought the whole point of that kind of storage was to not have to deal with running out of space, you just keep paying for more as you use more.

Anyway, I'll leave it at that. I'm probably not going to bother with snapshots in the future anyway. This issue, plus the fact they require snapshotting live data and can't be limited to just old data you want to archive (at least as far as I can tell...), means they don't do the job I want.

DaveCTurner commented 11 months ago

But, just like with storing indices, you can check for a percentage of free space and then refuse to start a snapshot if that percentage isn't available.

We do this with indices because the consequences of hitting disk-full while writing indices is rather severe. If we could safely do so, we'd run disks up to capacity in this area too.

Would it really matter for s3 apis? I thought the whole point of that kind of storage was to not have to deal with running out of space, you just keep paying for more as you use more.

Very much so, a substantial fraction of users store their snapshots in on-prem storage which claims some level of S3-compatibility, but none of those on-prem systems correctly emulate S3's lack of space constraints. (Whether they should be doing this is a whole other question, but unfortunately not one whose answer really matters in practice.)

lost2 commented 11 months ago

Having reported this back in jan 2021 (18 versions ago - v7.10 -> v8.10), I'm glad to know this issue is finally being addressed.

Also, I'm surprised that this is even possible to happen: "When shard level data files fail to write for snapshots, these data files become dangling and will not be referenced anywhere. Today we leave them as is". What's causing this "files fail to write" anyway? Thx

DaveCTurner commented 11 months ago

What's causing this "files fail to write" anyway?

It could be anything really (you might be amazed how flaky some users' storage is) but in the context of this issue the problem that matters most is running out of disk space.

Sergi-GC commented 5 months ago

Added the workarounds mentioned here in KB article https://support.elastic.co/knowledge/b1186c52

elastic / elasticsearch

Fail writing snapshot data to filesystem repo if space running out #67790

Workaround

Alternative workaround