Handle I/O exceptions during flush

npepinpe commented 10 months ago

Describe the bug

It seems right now errors that occur during flush in Raft are uncaught, and end up causing the partitions to go inactive. This mostly occurs when running Zeebe with an NFS file system.

This occurred in a support issue: https://jira.camunda.com/browse/SUPPORT-18929

- [ ] https://github.com/camunda/zeebe/issues/14867
- [ ] https://github.com/camunda/zeebe/issues/14864
- [ ] https://github.com/camunda/zeebe/issues/14865
- [ ] https://github.com/camunda/zeebe/issues/14866

Expected behavior

Flush errors are appropriately retried by retrying the entire append.

Log/Stacktrace

Full Stacktrace

``` 2023-10-20 16:58:48.185 [] [raft-server-0-raft-partition-partition-1] [] ERROR io.atomix.raft.impl.RaftContext - RaftServer{raft-partition-partition-1} - An uncaught exception occurred, transition to inactive role java.io.UncheckedIOException: java.io.IOException: Resource temporarily unavailable at java.nio.MappedMemoryUtils.force(Unknown Source) ~[?:?] at java.nio.Buffer$1.force(Unknown Source) ~[?:?] at jdk.internal.misc.ScopedMemoryAccess.forceInternal(Unknown Source) ~[?:?] at jdk.internal.misc.ScopedMemoryAccess.force(Unknown Source) ~[?:?] at java.nio.MappedByteBuffer.force(Unknown Source) ~[?:?] at java.nio.MappedByteBuffer.force(Unknown Source) ~[?:?] at io.camunda.zeebe.journal.file.Segment.flush(Segment.java:125) ~[zeebe-journal-8.2.12.jar:8.2.12] at io.camunda.zeebe.journal.file.SegmentsFlusher.flush(SegmentsFlusher.java:58) ~[zeebe-journal-8.2.12.jar:8.2.12] at io.camunda.zeebe.journal.file.SegmentedJournalWriter.flush(SegmentedJournalWriter.java:113) ~[zeebe-journal-8.2.12.jar:8.2.12] at io.camunda.zeebe.journal.file.SegmentedJournal.flush(SegmentedJournal.java:159) ~[zeebe-journal-8.2.12.jar:8.2.12] at io.atomix.raft.storage.log.RaftLogFlusher$DirectFlusher.flush(RaftLogFlusher.java:73) ~[zeebe-atomix-cluster-8.2.12.jar:8.2.12] at io.atomix.raft.storage.log.RaftLog.flush(RaftLog.java:184) ~[zeebe-atomix-cluster-8.2.12.jar:8.2.12] at io.atomix.raft.roles.PassiveRole.flush(PassiveRole.java:543) ~[zeebe-atomix-cluster-8.2.12.jar:8.2.12] at io.atomix.raft.roles.PassiveRole.appendEntries(PassiveRole.java:535) ~[zeebe-atomix-cluster-8.2.12.jar:8.2.12] at io.atomix.raft.roles.PassiveRole.handleAppend(PassiveRole.java:367) ~[zeebe-atomix-cluster-8.2.12.jar:8.2.12] at io.atomix.raft.roles.ActiveRole.onAppend(ActiveRole.java:50) ~[zeebe-atomix-cluster-8.2.12.jar:8.2.12] at io.atomix.raft.roles.FollowerRole.onAppend(FollowerRole.java:187) ~[zeebe-atomix-cluster-8.2.12.jar:8.2.12] at io.atomix.raft.impl.RaftContext.lambda$registerHandlers$14(RaftContext.java:293) ~[zeebe-atomix-cluster-8.2.12.jar:8.2.12] at io.atomix.raft.impl.RaftContext.lambda$runOnContext$21(RaftContext.java:304) ~[zeebe-atomix-cluster-8.2.12.jar:8.2.12] at io.atomix.utils.concurrent.SingleThreadContext$WrappedRunnable.run(SingleThreadContext.java:178) [zeebe-atomix-utils-8.2.12.jar:8.2.12] at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [?:?] at java.util.concurrent.FutureTask.run(Unknown Source) [?:?] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) [?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?] at java.lang.Thread.run(Unknown Source) [?:?] Caused by: java.io.IOException: Resource temporarily unavailable at java.nio.MappedMemoryUtils.force0(Native Method) ~[?:?] ... 26 more ```

Environment:

Zeebe Version: 8.2 and later

npepinpe commented 10 months ago

A workaround is to used the delayed flush with a very very low delay, as the DelayedFlusher handles flush errors properly, but the direct flusher's errors are completely un-handled.

lenaschoenburg commented 10 months ago

I spent a bit of time to read up on fsync-gate and related issues:

Fsync-gate summary: https://danluu.com/fsyncgate/
Handling writeback errors (2017): https://lwn.net/Articles/718734/
Fixing error reporting - again (2018): https://lwn.net/Articles/752613/
Handling I/O errors in the kernel (2018): https://lwn.net/Articles/757123/

Also, I found this paper really informative: Can Applications Recover from fsync Failures

What I gathered so far:

fsync does not reliably report errors, checking for errors might clear error flags. This is slowly improving but depends on fs support, kernel versions etc.
Dirty pages that couldn't be flushed are marked as clean automatically.
The content of these pages depends on the operating system and filesystem. (btrfs does rollback of pages to their previous contents, ext4 does not)
Crashing is not a valid recovery strategy - the page cache contains unknown data

If my understanding is correct, DelayedFlusher is not correct because it simply retries the flush which might be a no-op because all pages are marked clean after the initial flush failure.

As @npepinpe suggested, we could retry the entire write operation when a flush failed. That should overwrite the unknown, possibly corrupted, contents of the page cache and mark the pages as dirty which should result in another attempt to flush these pages to disk. However, we don't always have the full data that was supposed to be flushed. For example, a leader flushes whenever the commit index is updated. If that flush fails, we can no longer trust the page cache so we'd need to write everything since the last successful flush again. That'd require us to buffer every write in memory, separately from the page cache, which is probably a no-go.

npepinpe commented 10 months ago

Crashing is not a valid recovery strategy - the page cache contains unknown data

I think crashing may be a valid recovery in most cases for us, specifically a standard k8s deployment via stateful set. As long as the volume is unmounted/mounted, even rescheduling the pod to the same node should be fine as the page cache will be cleared. Crashing is a problem though if the application restarts on the same node (which was not restarted) and real (i.e. not network) disks are used.

lenaschoenburg commented 10 months ago

This particular case where a follower tried to append entries and then failed to flush might be solvable by simply retrying the write (or doing nothing and waiting for a new append request). That won't work for a leader though - unless we start flushing for every append too.

lenaschoenburg commented 10 months ago

I think crashing may be a valid recovery in most cases for us, specifically a standard k8s deployment via stateful set. As long as the volume is unmounted/mounted, even rescheduling the pod to the same node should be fine as the page cache will be cleared. Crashing is a problem though if the application restarts on the same node (which was not restarted) and real (i.e. not network) disks are used.

A simple crash will not create a new pod though. Are you sure that the volume will be unmounted in that case?

npepinpe commented 10 months ago

Ah, that's true. If it's not a new pod then no, I don't think the volume is unmounted :(

npepinpe commented 10 months ago

I would propose we handle it where possible - e.g. the follower as you mentioned - and crash everywhere else.

This behavior should be configurable however, as it would make Zeebe completely unusable on network storage.

I was hoping with newer kernels the page cache issue was not there anymore, but I guess that's not the case :( I'm not sure how we could verify this.

At any rate, we need a better solution here than the partition going inactive :smile: Let's brainstorm something

lenaschoenburg commented 10 months ago

Maybe leaders that fail to flush should reset their log to the previously flushed position and then step down? This might require some adjustments on the raft side but sounds doable.

We also need to keep the delayed flush in mind - if we want the same durability with delayed flush we can't consider unflushed data as committed.

npepinpe commented 10 months ago

@oleschoenburg and I discussed it today, and we highlighted the following issues related to flushing:

Meta store

We currently flush the meta store asynchronously, and as such, the operation is not necessarily retry-able. To make it so, we need to keep the complete meta file content (which is at most a few hundred bytes) in memory, and simply write out/flush it on retry. We then need to update all places calling the flush to handle retries.

I'll create an issue to handle this low hanging fruit

Snapshot checksum files

We also flush the snapshot checksum files. This is not retried right now, but our snapshot operations are resilient - on failure, the complete snapshot or replication will be retried. It's not the most efficient, but it doesn't cause any issues right now.

Directory file descriptors

In some instances we flush directories. I think these can also be safely retried in most cases, but it would also be acceptable to simply log a warning, as in most cases this is simply a fail safe.

I'll create an issue to deal with this low hanging fruit

Journal segments

Asynchronous flushing

Asynchronous flushing is a bit of a headache right now, since the whole point of it is to not be in the hot path. As such, there is no easy quick fix right now, since we never buffer the data in between flushes. I would propose we postpone dealing with this, as asynchronous flush is already not recommended and can lead to issues by itself - so users are already accepting this risk.

I would postpone this for now

Direct flushing

For direct flushing, we have a low hanging fruit: on the passive role, when an append fails to flush, simply retry the complete append request. We can do this inline, or we can return an error to the leader and have it resend the complete request to avoid endless loops.

For all other cases we would need to start buffering the appends. @oleschoenburg and I discussed a possible solution which we would like to prototype. It would serve as a first step towards moving us off of memory mapped files, and thus having more control over our I/O behavior, more observability, and better support for a wider range of file systems (incl. network file systems).

For any guaranteed flushed data, we just keep the current approach - read from the memory mapped buffer/page cache. But when you append something, we'll write it to the memory mapped buffer AND also in a in-memory buffer we own. If a reader wants to read past what has been flushed, it will always read from the in-memory buffer, not from the page cache. Once data is guaranteed flushed, it's evicted from the in-memory buffer, and readers can read from the page cache.

When flushing fails, we can simply write back what's in memory and flush again.

There's obviously some complexity involved here:

We need bounded buffers to avoid memory exhaustion
This means we need to deal with write stalls/back pressure in the append pipeline
When flushing occurs, readers must be moved from the in-memory buffer to the page cache, which may not be optimal.

This is not a quick fix, but it would be good to prototype this as soon as possible to get an idea of value/effort ratio.

I'll create an issue to prototype this

RocksDB

RocksDB also does its own flushing in the background. If we fail to flush data of the snapshots properly, we could end up with a corrupted snapshot. Flushing errors are logged in the background and thus not very visible...we also can't react to them, and we don't know if RocksDB can recover from fsync failures.

They do have the option of using DirectIO, but I don't know how good that is in terms of performance/memory usage.

@oleschoenburg proposed raising an issue with the RocksDB team to understand that before doing anything ourselves.

npepinpe commented 10 months ago

I've split the issues into multiple ones, and would propose downgrading this from a critical to high.

megglos commented 10 months ago

ZDP-Triage:

items 1-3 are quick fixes worth doing asap
last one could be handled with the nfs epic @megglos link it here

@megglos clarify with PM & Max whether we should invest here or push customers to not use NFS for now

Zelldon commented 10 months ago

Regarding rocksdb maybe interesting https://groups.google.com/g/rocksdb/c/hZ-H2o2j2CA/m/qfx-nOqSAQAJ

npepinpe commented 10 months ago

@deepthidevaki @oleschoenburg - could we implement minimal fault tolerance to flush error with the following:

Follower/passive role truncates the log and retries the full append on flush error
Leader truncates the log to commit index and steps down on flush error

In both cases, truncating should invalidate the buffers/page cache. Since we only update the commit index on successful flush, then truncating to the last known commit index should be safe.

This can cause a lot of disruption, but it will at least avoid inconsistencies. WDYT?

I would be happy to implement at least safety (even if performance/availability is terrible) first, and leave performance to the NFS feature request in the future.

deepthidevaki commented 10 months ago

Follower/passive role truncates the log and retries the full append on flush error Leader truncates the log to commit index and steps down on flush error

This should be safe, and makes sense to do it. But please verify how truncation works with concurrent readers, especially on the leader. On followers it should be fine because it is already expected in the normal execution path.

npepinpe commented 10 months ago

:+1: Then I would push to at least implement this as a critical bug fix, and we can postpone every thing else as part of NFS (or more general network file storage) support.

npepinpe commented 10 months ago

@npepinpe Split away the last proposal into a bug fix, decoupled from the NFS feature request.

akeller commented 9 months ago

A docs issue was raised for this topic. Can you add this to the correct Zeebe board so it's not missed?

j-lindner commented 9 months ago

Is this a problem that is inherently coming with NFS or due to NFS being (too) slow? --> If a customer has an NFS, but it comes with our recommended* 1.000 - 3.000 IOPS, would this be okay, or still be an issue (Or is it maybe impossible for NFS to have that many IOPS)?

cc: @xevien96

*Recommendations in context of cloud providers: https://docs.camunda.io/docs/8.2/self-managed/platform-deployment/helm-kubernetes/platforms/google-gke/#volume-performance https://docs.camunda.io/docs/8.2/self-managed/platform-deployment/helm-kubernetes/platforms/microsoft-aks/#volume-performance https://docs.camunda.io/docs/8.2/self-managed/platform-deployment/helm-kubernetes/platforms/amazon-eks/#volume-performance

npepinpe commented 9 months ago

It's inherent to NFS. Since any operation has a higher likelihood of failing due to network hiccups, this translates to higher likelihood of flush errors in the application. It's not directly performance related.

akeller commented 7 months ago

Hi Zeebe team, we had support reach out on the related docs topic - https://github.com/camunda/camunda-docs/issues/2987#issuecomment-1921259157

Is there a quick win here for docs? Can we say NFS is not supported?

lenaschoenburg commented 7 months ago

Yep, I think we can just say exactly that for Zeebe :+1:

npepinpe commented 7 months ago

I will bump the priority on https://github.com/camunda/camunda-docs/issues/2987, I think it's worth documenting in general the reasons not to use network file system at the moment. But we can do so in a follow up PR after if you're blocked waiting on us.

akeller commented 7 months ago

This docs PR will handle some, but not everything covered in the issue - https://github.com/camunda/camunda-docs/pull/3279

falko commented 1 month ago

@npepinpe does your research have any implications for running Zeebe on XFS? One of your sources is particularly talking about XFS limitations of ext3/ext4.

camunda / camunda