Open npepinpe opened 10 months ago
A workaround is to used the delayed flush with a very very low delay, as the DelayedFlusher handles flush errors properly, but the direct flusher's errors are completely un-handled.
I spent a bit of time to read up on fsync-gate and related issues:
Also, I found this paper really informative: Can Applications Recover from fsync Failures
What I gathered so far:
If my understanding is correct, DelayedFlusher
is not correct because it simply retries the flush which might be a no-op because all pages are marked clean after the initial flush failure.
As @npepinpe suggested, we could retry the entire write operation when a flush failed. That should overwrite the unknown, possibly corrupted, contents of the page cache and mark the pages as dirty which should result in another attempt to flush these pages to disk. However, we don't always have the full data that was supposed to be flushed. For example, a leader flushes whenever the commit index is updated. If that flush fails, we can no longer trust the page cache so we'd need to write everything since the last successful flush again. That'd require us to buffer every write in memory, separately from the page cache, which is probably a no-go.
Crashing is not a valid recovery strategy - the page cache contains unknown data
I think crashing may be a valid recovery in most cases for us, specifically a standard k8s deployment via stateful set. As long as the volume is unmounted/mounted, even rescheduling the pod to the same node should be fine as the page cache will be cleared. Crashing is a problem though if the application restarts on the same node (which was not restarted) and real (i.e. not network) disks are used.
This particular case where a follower tried to append entries and then failed to flush might be solvable by simply retrying the write (or doing nothing and waiting for a new append request). That won't work for a leader though - unless we start flushing for every append too.
I think crashing may be a valid recovery in most cases for us, specifically a standard k8s deployment via stateful set. As long as the volume is unmounted/mounted, even rescheduling the pod to the same node should be fine as the page cache will be cleared. Crashing is a problem though if the application restarts on the same node (which was not restarted) and real (i.e. not network) disks are used.
A simple crash will not create a new pod though. Are you sure that the volume will be unmounted in that case?
Ah, that's true. If it's not a new pod then no, I don't think the volume is unmounted :(
I would propose we handle it where possible - e.g. the follower as you mentioned - and crash everywhere else.
This behavior should be configurable however, as it would make Zeebe completely unusable on network storage.
I was hoping with newer kernels the page cache issue was not there anymore, but I guess that's not the case :( I'm not sure how we could verify this.
At any rate, we need a better solution here than the partition going inactive :smile: Let's brainstorm something
Maybe leaders that fail to flush should reset their log to the previously flushed position and then step down? This might require some adjustments on the raft side but sounds doable.
We also need to keep the delayed flush in mind - if we want the same durability with delayed flush we can't consider unflushed data as committed.
@oleschoenburg and I discussed it today, and we highlighted the following issues related to flushing:
We currently flush the meta store asynchronously, and as such, the operation is not necessarily retry-able. To make it so, we need to keep the complete meta file content (which is at most a few hundred bytes) in memory, and simply write out/flush it on retry. We then need to update all places calling the flush to handle retries.
I'll create an issue to handle this low hanging fruit
We also flush the snapshot checksum files. This is not retried right now, but our snapshot operations are resilient - on failure, the complete snapshot or replication will be retried. It's not the most efficient, but it doesn't cause any issues right now.
In some instances we flush directories. I think these can also be safely retried in most cases, but it would also be acceptable to simply log a warning, as in most cases this is simply a fail safe.
I'll create an issue to deal with this low hanging fruit
Asynchronous flushing is a bit of a headache right now, since the whole point of it is to not be in the hot path. As such, there is no easy quick fix right now, since we never buffer the data in between flushes. I would propose we postpone dealing with this, as asynchronous flush is already not recommended and can lead to issues by itself - so users are already accepting this risk.
I would postpone this for now
For direct flushing, we have a low hanging fruit: on the passive role, when an append fails to flush, simply retry the complete append request. We can do this inline, or we can return an error to the leader and have it resend the complete request to avoid endless loops.
For all other cases we would need to start buffering the appends. @oleschoenburg and I discussed a possible solution which we would like to prototype. It would serve as a first step towards moving us off of memory mapped files, and thus having more control over our I/O behavior, more observability, and better support for a wider range of file systems (incl. network file systems).
For any guaranteed flushed data, we just keep the current approach - read from the memory mapped buffer/page cache. But when you append something, we'll write it to the memory mapped buffer AND also in a in-memory buffer we own. If a reader wants to read past what has been flushed, it will always read from the in-memory buffer, not from the page cache. Once data is guaranteed flushed, it's evicted from the in-memory buffer, and readers can read from the page cache.
When flushing fails, we can simply write back what's in memory and flush again.
There's obviously some complexity involved here:
This is not a quick fix, but it would be good to prototype this as soon as possible to get an idea of value/effort ratio.
I'll create an issue to prototype this
RocksDB also does its own flushing in the background. If we fail to flush data of the snapshots properly, we could end up with a corrupted snapshot. Flushing errors are logged in the background and thus not very visible...we also can't react to them, and we don't know if RocksDB can recover from fsync failures.
They do have the option of using DirectIO, but I don't know how good that is in terms of performance/memory usage.
@oleschoenburg proposed raising an issue with the RocksDB team to understand that before doing anything ourselves.
I've split the issues into multiple ones, and would propose downgrading this from a critical to high.
ZDP-Triage:
@megglos clarify with PM & Max whether we should invest here or push customers to not use NFS for now
Regarding rocksdb maybe interesting https://groups.google.com/g/rocksdb/c/hZ-H2o2j2CA/m/qfx-nOqSAQAJ
@deepthidevaki @oleschoenburg - could we implement minimal fault tolerance to flush error with the following:
In both cases, truncating should invalidate the buffers/page cache. Since we only update the commit index on successful flush, then truncating to the last known commit index should be safe.
This can cause a lot of disruption, but it will at least avoid inconsistencies. WDYT?
I would be happy to implement at least safety (even if performance/availability is terrible) first, and leave performance to the NFS feature request in the future.
Follower/passive role truncates the log and retries the full append on flush error Leader truncates the log to commit index and steps down on flush error
This should be safe, and makes sense to do it. But please verify how truncation works with concurrent readers, especially on the leader. On followers it should be fine because it is already expected in the normal execution path.
:+1: Then I would push to at least implement this as a critical bug fix, and we can postpone every thing else as part of NFS (or more general network file storage) support.
@npepinpe Split away the last proposal into a bug fix, decoupled from the NFS feature request.
A docs issue was raised for this topic. Can you add this to the correct Zeebe board so it's not missed?
Is this a problem that is inherently coming with NFS or due to NFS being (too) slow? --> If a customer has an NFS, but it comes with our recommended* 1.000 - 3.000 IOPS, would this be okay, or still be an issue (Or is it maybe impossible for NFS to have that many IOPS)?
cc: @xevien96
*Recommendations in context of cloud providers: https://docs.camunda.io/docs/8.2/self-managed/platform-deployment/helm-kubernetes/platforms/google-gke/#volume-performance https://docs.camunda.io/docs/8.2/self-managed/platform-deployment/helm-kubernetes/platforms/microsoft-aks/#volume-performance https://docs.camunda.io/docs/8.2/self-managed/platform-deployment/helm-kubernetes/platforms/amazon-eks/#volume-performance
It's inherent to NFS. Since any operation has a higher likelihood of failing due to network hiccups, this translates to higher likelihood of flush errors in the application. It's not directly performance related.
Hi Zeebe team, we had support reach out on the related docs topic - https://github.com/camunda/camunda-docs/issues/2987#issuecomment-1921259157
Is there a quick win here for docs? Can we say NFS is not supported?
Yep, I think we can just say exactly that for Zeebe :+1:
I will bump the priority on https://github.com/camunda/camunda-docs/issues/2987, I think it's worth documenting in general the reasons not to use network file system at the moment. But we can do so in a follow up PR after if you're blocked waiting on us.
This docs PR will handle some, but not everything covered in the issue - https://github.com/camunda/camunda-docs/pull/3279
@npepinpe does your research have any implications for running Zeebe on XFS? One of your sources is particularly talking about XFS limitations of ext3/ext4.
Describe the bug
It seems right now errors that occur during flush in Raft are uncaught, and end up causing the partitions to go inactive. This mostly occurs when running Zeebe with an NFS file system.
This occurred in a support issue: https://jira.camunda.com/browse/SUPPORT-18929
Expected behavior
Flush errors are appropriately retried by retrying the entire append.
Log/Stacktrace
Full Stacktrace
``` 2023-10-20 16:58:48.185 [] [raft-server-0-raft-partition-partition-1] [] ERROR io.atomix.raft.impl.RaftContext - RaftServer{raft-partition-partition-1} - An uncaught exception occurred, transition to inactive role java.io.UncheckedIOException: java.io.IOException: Resource temporarily unavailable at java.nio.MappedMemoryUtils.force(Unknown Source) ~[?:?] at java.nio.Buffer$1.force(Unknown Source) ~[?:?] at jdk.internal.misc.ScopedMemoryAccess.forceInternal(Unknown Source) ~[?:?] at jdk.internal.misc.ScopedMemoryAccess.force(Unknown Source) ~[?:?] at java.nio.MappedByteBuffer.force(Unknown Source) ~[?:?] at java.nio.MappedByteBuffer.force(Unknown Source) ~[?:?] at io.camunda.zeebe.journal.file.Segment.flush(Segment.java:125) ~[zeebe-journal-8.2.12.jar:8.2.12] at io.camunda.zeebe.journal.file.SegmentsFlusher.flush(SegmentsFlusher.java:58) ~[zeebe-journal-8.2.12.jar:8.2.12] at io.camunda.zeebe.journal.file.SegmentedJournalWriter.flush(SegmentedJournalWriter.java:113) ~[zeebe-journal-8.2.12.jar:8.2.12] at io.camunda.zeebe.journal.file.SegmentedJournal.flush(SegmentedJournal.java:159) ~[zeebe-journal-8.2.12.jar:8.2.12] at io.atomix.raft.storage.log.RaftLogFlusher$DirectFlusher.flush(RaftLogFlusher.java:73) ~[zeebe-atomix-cluster-8.2.12.jar:8.2.12] at io.atomix.raft.storage.log.RaftLog.flush(RaftLog.java:184) ~[zeebe-atomix-cluster-8.2.12.jar:8.2.12] at io.atomix.raft.roles.PassiveRole.flush(PassiveRole.java:543) ~[zeebe-atomix-cluster-8.2.12.jar:8.2.12] at io.atomix.raft.roles.PassiveRole.appendEntries(PassiveRole.java:535) ~[zeebe-atomix-cluster-8.2.12.jar:8.2.12] at io.atomix.raft.roles.PassiveRole.handleAppend(PassiveRole.java:367) ~[zeebe-atomix-cluster-8.2.12.jar:8.2.12] at io.atomix.raft.roles.ActiveRole.onAppend(ActiveRole.java:50) ~[zeebe-atomix-cluster-8.2.12.jar:8.2.12] at io.atomix.raft.roles.FollowerRole.onAppend(FollowerRole.java:187) ~[zeebe-atomix-cluster-8.2.12.jar:8.2.12] at io.atomix.raft.impl.RaftContext.lambda$registerHandlers$14(RaftContext.java:293) ~[zeebe-atomix-cluster-8.2.12.jar:8.2.12] at io.atomix.raft.impl.RaftContext.lambda$runOnContext$21(RaftContext.java:304) ~[zeebe-atomix-cluster-8.2.12.jar:8.2.12] at io.atomix.utils.concurrent.SingleThreadContext$WrappedRunnable.run(SingleThreadContext.java:178) [zeebe-atomix-utils-8.2.12.jar:8.2.12] at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [?:?] at java.util.concurrent.FutureTask.run(Unknown Source) [?:?] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) [?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?] at java.lang.Thread.run(Unknown Source) [?:?] Caused by: java.io.IOException: Resource temporarily unavailable at java.nio.MappedMemoryUtils.force0(Native Method) ~[?:?] ... 26 more ```
Environment: