Raft journal system pre-apply statemachine may cause inconsistency

adol001 commented 2 years ago

Alluxio Version: 2.7.2

Describe the bug If delete a file and pre-apply. But this log is rejected by most nodes in the high-availability cluster (the new leader is elected), then should this file be deleted?

In the comments of RaftJournalSystem, it is mentioned that

There are two cases to worry about: (1) incorrectly ignoring an entry and (2) incorrectly
applying an entry.
Entries can never be double-applied to a primary's state because as long as it is the primary, it
will ignore all entries, and once it becomes standby, it will completely reset its state and
rejoin the cluster.

But how to undo the applied state changes? For example, changes in ufs?

LuQQiu commented 2 years ago

@jenoudet @yuzhu PTAL, thanks

adol001 commented 2 years ago

In the comments of RaftJournalSystem, it is mentioned that

This could happen if a server thinks it is the primary, and ignores a journal entry served from
the real primary. To prevent this, primaries wait for a quiet period before serving requests.
During this time, the previous primary will go through at least two election cycles without
successfully sending heartbeats to the majority of the cluster. If the old primary successfully
sent a heartbeat to a node which elected the new primary, the old primary would realize it isn't
primary and step down. Therefore, the old primary will step down and no longer ignore requests by
the time the new primary begins sending entries.

The network partition may cause multiple leaders. If the old leader is continuously unable to connect with the new leader and new leader's followers in the cluster, then the old leader always thinks that it is the leader. At this time, will the client connecting to the old leader and the client connecting to the new leader both complete the writing? Because of pre-apply.

jenoudet commented 2 years ago

But how to undo the applied state changes? For example, changes in ufs?

The Raft algorithm has built in mechanisms to revert changes on standby masters that have diverged from the quorum. See section 5.3 of the Raft paper.

The network partition may cause multiple leaders. If the old leader is continuously unable to connect with the new leader and new leader's followers in the cluster, then the old leader always thinks that it is the leader. At this time, will the client connecting to the old leader and the client connecting to the new leader both complete the writing? Because of pre-apply.

We recommend using an odd number of masters when running HA to mitigate some partition problems. With an odd number of masters you cannot get an even 50/50 partition of the network. This ensures the old master will lose the majority heartbeats in the quorum and quickly transition to standby state. The old leader does not need to connect to a new leader to step down, losing the majority heartbeats is enough for it to step down on its own. A single leading master will emerge that the client will connect to.

P.S.

the previous primary will go through at least two election cycles

Some of these comments are out of date. The waiting logic was updated in #14612 but the comments were not updated alongside it. Thank you for pointing that out, I will fix them now.

adol001 commented 2 years ago

"Safety: the key safety property for Raft is the State
Machine Safety Property in Figure 3: if any server
has applied a particular log entry to its state machine,
then no other server may apply a different command
for the same log index. "

Yes. pre-apply violates the above principles

The normal practice is to commit log first and then apply log. But in alluxio it becomes apply log first and then commit log, For example

  public void deleteInode(RpcContext rpcContext, LockedInodePath inodePath, long opTimeMs)
      throws FileDoesNotExistException {
    Preconditions.checkState(inodePath.getLockPattern() == LockPattern.WRITE_EDGE);
    Inode inode = inodePath.getInode();

    mState.applyAndJournal(rpcContext, DeleteFileEntry.newBuilder()
        .setId(inode.getId())
        .setRecursive(false)
        .setOpTimeMs(opTimeMs)
        .setPath(inodePath.getUri().getPath())
        .build());

    if (inode.isFile()) {
      rpcContext.getBlockDeletionContext().registerBlocksForDeletion(inode.asFile().getBlockIds());
    }
  }

mState.applyAndJournal

流程图

If there is an error in the AsyncJournalWriter or journal system, the state machine is applied, but the log is not committed because the jvm crashes or the ha cluster reselect new master, and there will be inconsistencies.

Steps to reproduce alluxio write type is cache through

In doFlush() of AsyncJournalWriter, before mJournalWriter.write(entry); insert
String cc = entry.toString();
LOG.info("aaaaaaaaaaaaaaaaaaaa" + cc);

      if (cc.contains("delete_file")){
        try {
          LOG.info("eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee2");
          Thread.sleep(1000 * 100000);
        } catch (Exception e) {

        }
      }

> Then comple code, like

mvn clean install -Phadoop-2 -Dhadoop.version=2.7.3 -e -Dmaven.test.skip=true -Dmaven.javadoc.skip -Dlicense.skip -pl 'core/server/master' -pl 'core/server/common' -pl 'assembly/server' -Dcheckstyle.skip

> start alluxio service, then create file and delete it, like

FileSystem fs = FileSystem.Factory.get();
AlluxioURI path = new AlluxioURI("/h11/yyds21");
CreateFilePOptions options = CreateFilePOptions.newBuilder()
  .setRecursive(true)
  //.setWriteType(WritePType.ASYNC_THROUGH)
  .build();
FileOutStream outStream = fs.createFile(path, options);
outStream.write("right12345".getBytes(StandardCharsets.UTF_8));
outStream.close();

Thread.sleep(1000 * 10);
fs.delete(path);

you will see /h11/yyds21 it appears and then disappears, the delete operation is blocked by force journal flush at end of service action
> Now stop alluxio services, this is used to simulate the jvm crash that occurs when AsyncJournalWriter flushes
Delete operation from client will fail

[alluxio.exception.status.UnavailableException: Failed to connect to master (10.91.128.211:19998) after 45 attempts.Please check if Alluxio master is currently running on "10.91.128.211:19998". Service="FileSystemMasterClient"] in 159355 ms (>=10000ms)

> Then start alluxio services, you will see /h11/yyds21 appears again, free its cache

alluxio fs free /h11/yyds21


Now, the metadata of /h11/yyds21 exists, but data in ufs(local filesystem in this example) losts, and the loss is irrecoverable

Whether the journal system fails or the async journal writer fails( #14813  ), in other words, the state machine has been modified but the log has not been committed, data inconsistency on ufs may be caused by preapply
@jenoudet @LuQQiu

qian0817 commented 2 years ago

I think it's the same problem as #14626. But do you have any possible ideas to solve this problem? @adol001

yuzhu commented 2 years ago

@ggezer FYI

yuzhu commented 2 years ago

@adol001 thanks for the detailed report. We used asyncJournalWriter to write batched journals because hdfs based journal would offer unbearable performance if the entries are written synchronously. For Raft-based implementations, that requirement might not be true anymore, and we may be sacrificing safety unnecessarily here.
do you think a synchronous journal writer is even an option if some users prefer the guarantees it offers?

@ggezer @jenoudet any thoughts on this?

adol001 commented 2 years ago

The core of the problem is that apply log to statemachine precedes the commit log. (preapply)

This will cause: 1 The client to get inconsistent results after the master is restarted or a new master is created. For example, /h11/yyds21 is deleted in statemachine, but the delete log is not committed successfully. In this time window, the client requests /h11/yyds21, and /h11/yyds21 does not exist. If a restart occurs later or a new master is selected, then /h11/yyds21 exists. Such results are problematic.

2 In the case of alluxio, the state machine mainly includes inode tree and ufs, inode tree is responsible for metadata, and ufs is responsible for remote data. ufs is a persistent storage. If you use preapply, you need to consider how to undo operations on ufs. @qian0817

Preapply is related to any journal system, that is to say, the hdfs + zk journal system should also encounter this problem.

We hope alluxio provides a new option, apply log after commit log, do not use preapply @yuzhu

jenoudet commented 2 years ago

A synchronous journal writer has some pros and cons. Here are some that spring to mind. Pros:

Easier to reason about.
Safer when reading and writing to / from UFS.

Cons:

Slower than an asynchronous writer (as it will be as slow as the UFS and the network combined).
More journaling options brings more complexity and difficulty in maintaining the journaling system.

adol001 commented 2 years ago

Code from hdfs

  boolean delete(String src, boolean recursive, boolean logRetryCache)
      throws IOException {
    final String operationName = "delete";
    BlocksMapUpdateInfo toRemovedBlocks = null;
    checkOperation(OperationCategory.WRITE);
    final FSPermissionChecker pc = getPermissionChecker();
    FSPermissionChecker.setOperationType(operationName);
    boolean ret = false;
    try {
      writeLock();
      try {
        checkOperation(OperationCategory.WRITE);
        checkNameNodeSafeMode("Cannot delete " + src);
        toRemovedBlocks = FSDirDeleteOp.delete(
            this, pc, src, recursive, logRetryCache);
        ret = toRemovedBlocks != null;
      } finally {
        writeUnlock(operationName);
      }
    } catch (AccessControlException e) {
      logAuditEvent(false, operationName, src);
      throw e;
    }
    getEditLog().logSync();
    logAuditEvent(ret, operationName, src);
    if (toRemovedBlocks != null) {
      removeBlocks(toRemovedBlocks); // Incremental deletion of blocks
    }
    return ret;
  }

This step can be roughly divided into three steps

delete in inode tree
log
delete blocks

Because the blocks are deleted at the end, even if the namenode commit the deletion log unsuccessfully, it can be recovered finally.

1 The client to get inconsistent results after the master is restarted or a new master is created.
For example, /h11/yyds21 is deleted in statemachine, but the delete log is not committed successfully. In this time window, the client requests /h11/yyds21, and /h11/yyds21 does not exist. If a restart occurs later or a new master is selected, then /h11/yyds21 exists. Such results are problematic.

Because of the widespread use of hdfs, I think customers should be able to accept case 1

The current alluxio delete steps are

delete in ufs
delete in inode tree
log

In order to avoid the inconsistency of ufs, you can only change the preapply for ufs to commit-apply for ufs, and keep preapply for others

The new alluxio delete steps can be

operate in inode tree
log
operate in ufs

@yuzhu @jenoudet

yuzhu commented 2 years ago

@adol001 Thanks for the detailed explanation.

Because of the widespread use of hdfs, I think customers should be able to accept case 1

The current alluxio delete steps are

delete in ufs

delete in inode tree

log

In order to avoid the inconsistency of ufs, you can only change the preapply for ufs to commit-apply for ufs, and keep preapply for others

The new alluxio delete steps can be

operate in inode tree

log

operate in ufs

The new order has similar issue.
You mentioned that

1 The client to get inconsistent results after the master is restarted or a new master is created. For example, /h11/yyds21 is deleted in statemachine, but the delete log is not committed successfully. In this time window, the client requests /h11/yyds21, and /h11/yyds21 does not exist. If a restart occurs later or a new master is selected, then /h11/yyds21 exists. Such results are problematic.

But this situation could just as easily happen in the new order. As soon as 1 is complete, other clients can observe the file deletion. 2 and 3 can fail and the file will reappear again. What is perhaps better in this case is since we still have the file in ufs, so we can still read the file.

However, if failure happens during step 3, then after restart, we will not have the file in alluxio space but a resync with UFS will make the file appear again. Alluxio and UFS are out of sync again even though operations are done only through alluxio. This order is actually quite similar to async persist event order today, so I do see value in making it consistent across all operations.

fundamentally because operation in alluxio namespace and operation in the ufs do not have a transactional semantics around it, we will never have guarantee that ufs is consistent with alluxio space, even if all operations happen through alluxio. It is a topic that will have more discussions in the coming weeks, because stronger consistency can be prohibitively expensive .

I think it is a great topic for a technical blog to let the community understand the semantics and tradeoffs we are making. @apc999

adol001 commented 2 years ago

If all operations go through alluxio, the metadata should be based on alluxio. It should not load the missing metadata from ufs when alluxio restarts or re-selects the master. Synchronization with ufs should only exist at the beginning of mount(when all operations go through alluxio), or Afterwards the user manually synchronizes and handles or accepts possible conflicts

If all operations go through alluxio, the user will not accept unrecoverable ufs inconsistencies

ggezer commented 2 years ago

@adol001

I think HDFS code can't be directly translated to Alluxio. Because in Alluxio's architecture there are two levels of storage, namely Alluxio cache and UFS.

For failure scenario you brought up, we simply can't make an atomic write across 2 storages. Under failures, inconsistencies are inevitable. We're just making decisions to make that inconsistencies are safe. IMO, having a ghost metadata on Alluxio is acceptable. As for block-deletion order, block deletion can always get interrupted by another failure and we could end up with unrecoverable state.

I believe the only decision that could be discussed is the order of executing the operation against UFS. I don't think there is a best answer to this question but we can lead the discussion from that direction if you believe one way is better than the other.

adol001 commented 2 years ago

@ggezer

Inconsistency is inevitable for two separate and united systems, but the data in them need to be recoverable. Inconsistency but recoverable

"Having a ghost metadata on Alluxio is acceptable", this may make big data jobs fail. We can't let customers add redundant logic for this

What do you think about tuning the order of executing the operation against UFS?

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

QueenOfBees commented 5 months ago

any subsequent discusssion or plan for this issue?

btw, I noticed that doris also use the pre-apply, it's eventual consistency as they claims (https://doris.apache.org/zh-CN/community/design/metadata-design/)

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

ilixiaocui commented 3 weeks ago

For failure scenario you brought up, we simply can't make an atomic write across 2 storages. Under failures, inconsistencies are inevitable. We're just making decisions to make that inconsistencies are safe. IMO, having a ghost metadata on Alluxio is acceptable. As for block-deletion order, block deletion can always get interrupted by another failure and we could end up with unrecoverable state.

But this current approach, in addition to causing inconsistencies between Alluxio and UFS, can also lead to inconsistencies among the Raft replica groups within Alluxio itself.

Do we need to ensure consistency among multiple replicas of the Master?

Alluxio / alluxio

Raft journal system pre-apply statemachine may cause inconsistency #14806