Integrate Apache Ratis to help manage Raft status

apache / iotdb

Apache IoTDB

https://iotdb.apache.org/

Apache License 2.0

5.6k stars 1.02k forks source link

Integrate Apache Ratis to help manage Raft status #3954

Open chengjianyun opened 3 years ago

chengjianyun commented 3 years ago

We have some discussions about the integration of Apache Ratis in RP: https://github.com/apache/iotdb/pull/3939.

Create a new issue here to let more people join.

https://github.com/apache/iotdb/pull/3939#issuecomment-917955815

https://github.com/apache/iotdb/pull/3939#issuecomment-917961701

https://github.com/apache/iotdb/pull/3939#issuecomment-917983405

Update the reasons why I'd like to import a 3rd party Raft library.

First and the important one, raft implementation in IoTDB can't guarantee the correctness (I mean consensus part, even only for write). We have found and fixed some bugs about Raft and we are not sure the end of this. Just like @LebronAl said, the devil lies in the corner cases. It's very hard for us to construct a systematic UT/IT/Manual tests to ensure the safe properties of Raft. As for performance affect, my understanding is even the fastest vehicle should be saft. And raft is only a small part of system, I think if we can use the 3rd party library properly, performance loss should be very small.
Second, current implementation of Raft mixed together with application logic which is hard to maintain and understand.
Last but not least, apply 3rd party library could release us from detail of Raft and focus on core engine improvement, building IoTDB ecosystem, etc.

chengjianyun commented 3 years ago

Reply https://github.com/apache/iotdb/pull/3939#issuecomment-917983405, @LebronAl

Production-level raft algorithms generally guarantee linearizability, which is the strictest consistency in distributed systems. Our current Raft algorithm may not be complete, so strictly speaking it does not guarantee linearizability, but at the same time it's performance may be better. What would we say if we migrated raft implementation and performance dropped? Furthermore, is linearizability really necessary for OLAP databases like time-series databases?

What do you mean by "strictly speaking it does not guarantee linearizability"? In which case, IoTDB doesn't guarantee the linearizability? In my opinion, it doesn't matter IoTDB or other products, if you are using Raft, that means you need the linearizability which is what raft provided. If IoTDB doesn't guarantee the linearizability, I think it's a bug. Right now, I didn't find any of the case.

I agree that we can simply some aspects of Raft implementation but which won't affect performance much. The main performance improvement relays on the multi-raft support.

At the moment we have Raft implementation and business logic mixed together, but there is a certain amount of performance improvement. For example, we currently put data from multiple storage groups into one Raft group to be executed synchronously. In fact, at the bottom, they can be executed in parallel. So we implemented parallel asynchronous apply optimization so that plans from the same Raft group but different storage groups can be applied in parallel. The performance gains from this optimization are significant. How should we handle this case after we migrate raft implementation? Add parallel apply feature to raft library Or do we change our partitioning pattern so that a Raft group serves only one storage group? Or just let them execute inefficiently?

"For example, we currently put data from multiple storage groups into one Raft group to be executed synchronously." Could you explain this one more detail?

Add parallel apply feature to raft library Or do we change our partitioning pattern so that a Raft group serves only one storage group?

Application(Raft user) takes care of log apply, so it doesn't matter how Raft implemented.

Finally, in my opinion, current high throughput mainly gains from multi-raft support. So it won't affect performance too much after integrate Ratis according to my understanding.

neuyilan commented 3 years ago

is linearizability really necessary for OLAP databases like time-series databases?

As I see the raft is one consensus algorithm, which may do not have many relations with the linearizability, just put the post[1] for reference.

we currently put data from multiple storage groups into one Raft group to be executed synchronously.

As far as I know, the apply function is user-defined, we can still implement a parallel apply function according to different storage groups in the one raft log.

What is certain is that using a raft library will definitely limit our optimization work compared with the current implementation (mixing the raft framework and business logic), but I think the availability and correctness are far greater than the performance for now. [1] https://zhuanlan.zhihu.com/p/47117804

OneSizeFitsQuorum commented 3 years ago

As I see the raft is one consensus algorithm, which may do not have many relations with the linearizability, just put the post[1] for reference.

In academic terms, whether it is Raft, Multi-Paxos or Zab, these consensus algorithms exist only for multi-node consensus, and have nothing to do with consistency or storage systems. I admit that.

In engineering, what we call Raft is often combined with replication state machines and storage systems. In this context, the relationship between Raft and linearizability is very large.

Take the TiKV you quoted in the blog for example. For writing, Raft needs to ensure that the log which received the majority of acks can be committed. For reading, Raft needs to use Read-Index or Lease-Read to ensure linearizability. These designs are actually decisive factors affecting performance, and the key lies in whether we need such safety. However, such core logic is likely to be hard-coded in the Raft library and cannot be changed (if it can be changed, you can ignore me). But in fact, for OLAP scenarios, do we really need such a high level of consistency? For example, tdengine's default synchronization strategy is asynchronous replication rather than majority.

For another example, Zookeeper's ZAB algorithm guarantees sequential consistency, which is a slightly weaker than linearizable consistency but still a strong consistency level (refer to the jepsen official website). Then in theory, it has higher performance than the Raft algorithm that guarantees linearizability.

I'm not talking about which solution is better. I just think that since we are talking about refactoring, we should think clearly about the guarantee that the entire data model can provide to the outside world. Because they may often determine the upper bound of performance, and in most cases, these are trade-offs between safety and performance. If we don't need some level of safety, then we can definitely go for better performance.

As far as I know, the apply function is user-defined, we can still implement a parallel apply function according to different storage groups in the one raft log.

In the etcd example, we do have the freedom to handle the commit logs, thanks to their excellent abstraction. In ratis's example, it looks like they only exposed one applyTransaction interface for us to override, and I doubt whether we can implement our parallel asynchronous apply optimizations. Of course, I've just had a glance. This area needs further investigation. I just hope some of our optimizations don't go away after the migration.

What is certain is that using a raft library will definitely limit our optimization work compared with the current implementation (mixing the raft framework and business logic), but I think the availability and correctness are far greater than the performance for now.

If performance drops a little bit after migration, I support it. But if it's a big drop, I think it still needs to be considered very carefully. Of course, a preliminary conclusion can be made after further investigation.

OneSizeFitsQuorum commented 3 years ago

What do you mean by "strictly speaking it does not guarantee linearizability"? In which case, IoTDB doesn't guarantee the linearizability? In my opinion, it doesn't matter IoTDB or other products, if you are using Raft, that means you need the linearizability which is what raft provided. If IoTDB doesn't guarantee the linearizability, I think it's a bug. Right now, I didn't find any of the case.

The devil of consensus lies in the corner case. So let me give you three examples that I know so far that may violate linearizability.

The uncommitted Raft logs are not persisted. Consider a scenario where node A is the leader, nodes B and C are followers, node A synchronizes the log to B and C and gets all acks, then submits and applies the log, and then returns success to the client before the next heartbeat to followers. Then there was a momentary power outage, nodes B and C restarted immediately, but node A restarted slowly. Node B and C's log are empty after restarting and recovering, but they are already a majority, so they can serve the client's read request, then this may violate linearizability. Of course, this is strictly an implementation bug and can be fixed. But even if the bug is fixed, you can see that our raft log's serialization buffer refers to the implementation of the stand-alone WAL, it does not write to the disk and call async every time a log is written, which ensures that the performance will not be limited by the IOPS of the disk. But this may also cause the corner case mentioned above. This is the trade-off between performance and safety.
In fact, to ensure linearizability Raft uses read-index or lease-read. In our current implementation, direct-read is used for the leader and read-index is used for the follower. This can violate linearizability when a node outage causes a replacement node to execute the same read request. For more specific examples, you can refer to my blog. Of course, we could also use read-index on the leader, but this would undoubtedly degrade performance.
In fact, Raft's naive implementation guarantees at-least-once semantics, and to ensure linearizability semantics, uuid is generated on the client side and a map is logged on the server side to ensure that each command is executed only once. You can refer to section 6.3 of Raft's PhD thesis and dragonboat's discussion on Zhihu.There is no doubt that such an implementation will also affect performance.

Application(Raft user) takes care of log apply, so it doesn't matter how Raft implemented.

I hope so~Maybe we need to make a detailed investigation on Ratis

chengjianyun commented 3 years ago

The uncommitted Raft logs are not persisted. Consider a scenario where node A is the leader, nodes B and C are followers, node A synchronizes the log to B and C and gets all acks, then submits and applies the log, and then returns success to the client before the next heartbeat to followers. Then there was a momentary power outage, nodes B and C restarted immediately, but node A restarted slowly. Node B and C's log are empty after restarting and recovering, but they are already a majority, so they can serve the client's read request, then this may violate linearizability.

Agree that the case is a bug for IoTDB.

In fact, to ensure linearizability Raft uses read-index or lease-read. In our current implementation, direct-read is used for the leader and read-index is used for the follower. This can violate linearizability when a node outage causes a replacement node to execute the same read request. For more specific examples, you can refer to my blog. Of course, we could also use read-index on the leader, but this would undoubtedly degrade performance.

Sorry that I didn't make the assumption clear. I think we are talking things under IoTDB environment. So I didn't take Read operation into consideration. As you have said, linearization the read is not a regular operation in OLAP system.

And let's go back to Raft library topic. Import the Raft library doesn't mean application needs to sync all operation via the library, it depends on requirement. I don't think IoTDB needs to change current read action process flow after import the library.

And learned a lot from your blog and references, thanks.

In fact, Raft's naive implementation guarantees at-least-once semantics, and to ensure linearizability semantics, uuid is generated on the client side and a map is logged on the server side to ensure that each command is executed only once.

Agree the concept but which is unnecessary under IoTDB or time-order data environment. I think the exactly-only-once requirement here as the actions are not idempotent for all system. But operations in time-order-data or ``most of actions in IoTDB (in my understanding all of actions right now) are idempotent naturally. And again, all of these are in scope of application instead of Raft.

OneSizeFitsQuorum commented 3 years ago

Agree that the case is a bug for IoTDB.

Yes

Of course, this is strictly an implementation bug and can be fixed. But even if the bug is fixed, you can see that our raft log's serialization buffer refers to the implementation of the stand-alone WAL, it does not write to the disk and call async every time a log is written, which ensures that the performance will not be limited by the IOPS of the disk. But this may also cause the corner case mentioned above. This is the trade-off between performance and safety.

But the key sentence is this. Even if the bug is fixed. Should we also call async every time when we flush log? The current implementation is a performance-oriented compromise for us, but it certainly sacrifices security. Strictly speaking, we currently buffer data through user-space buffers, whether standalone wal or raft logs. However, with Ratis, they seem to be caching the data in the operating system's Page cache, and async is determined based on the parameters and load. I think our approach’s performance will be better (because it sacrifices more security than theirs), but I don't know how much of performance difference there is. This is worth exploring.

And let's go back to Raft library topic. Import the Raft library doesn't mean application needs to sync all operation via the library, it depends on requirement. I don't think IoTDB needs to change current read action process flow after import the library.

Hope so. IMO, the more additional code we have to write using these Raft libraries, the more optimization possibilities we have for our scenario, just like with etcd. But with Ratis, I'm not sure yet.

Agree the concept but which is unnecessary under IoTDB or time-order data environment. I think the exactly-only-once requirement here as the actions are not idempotent for all system.

In fact, even if every command is idempotent, as long as the client adds retry logic, it will violate linearizability without filtering as described above... This is a fairly advanced feature, as far as I know, only the Raft library for Dragonboat supports this logic, and I don't think Ratis has done it yet. So we can forget about this.

chengjianyun commented 3 years ago

Agree that the case is a bug for IoTDB.

Yes

Of course, this is strictly an implementation bug and can be fixed. But even if the bug is fixed, you can see that our raft log's serialization buffer refers to the implementation of the stand-alone WAL, it does not write to the disk and call async every time a log is written, which ensures that the performance will not be limited by the IOPS of the disk. But this may also cause the corner case mentioned above. This is the trade-off between performance and safety.

But the key sentence is this. Even if the bug is fixed. Should we also call async every time when we flush log? The current implementation is a performance-oriented compromise for us, but it certainly sacrifices security. Strictly speaking, we currently buffer data through user-space buffers, whether standalone wal or raft logs. However, with Ratis, they seem to be caching the data in the operating system's Page cache, and async is determined based on the parameters and load. I think our approach’s performance will be better (because it sacrifices more security than theirs), but I don't know how much of performance difference there is. This is worth exploring.

IMO, this depends on what kinds of properties IoTDB'd like to guarantee for user. If IoTDB says, I won't lost any data in any situation, then persist each log every time is required and relative performance lose should be acceptable. But if it's doesn't do such guarantee, then it's OK to do some optimization. Of course, we can make it configurable just like standalone WAL in IoTDB. As I didn't look into Ratis, I'm not sure if it makes this configurable.

And let's go back to Raft library topic. Import the Raft library doesn't mean application needs to sync all operation via the library, it depends on requirement. I don't think IoTDB needs to change current read action process flow after import the library.

Hope so. IMO, the more additional code we have to write using these Raft libraries, the more optimization possibilities we have for our scenario, just like with etcd. But with Ratis, I'm not sure yet.

Agree the concept but which is unnecessary under IoTDB or time-order data environment. I think the exactly-only-once requirement here as the actions are not idempotent for all system.

In fact, even if every command is idempotent, as long as the client adds retry logic, it will violate linearizability without filtering as described above... This is a fairly advanced feature, as far as I know, only the Raft library for Dragonboat supports this logic, and I don't think Ratis has done it yet. So we can forget about this.

I don't understand, on zhihu, it says "proposal id" is generated by client, and filter when do apply(often implemented by State Machine). So I think the retry logic is transparent for raft, the retry operation is new log for raft. Not sure my understanding is right or not.

cigarl commented 3 years ago

But the key sentence is this. Even if the bug is fixed. Should we also call async every time when we flush log?

Agree with you, in the IOT world, data may not be as sensitive,maybe we can sacrifice security (in some cases) to improve performance (we can even provide configurations to meet different needs of users).But for now, this approach seems to pose a greater danger (eg. cluster can not be restarted successfully, or enter an unstable state, returning confusing results to the user, etc) in corner cases.

BTW,maybe we should comb through some optimizations to compare.I'm so sorry that I seem to see only that the strategy of flushing to disk has an effect, the impact at the Raft level doesn't seem to be as large as expected.

We have a lot of questions about Ratis, let me spend some time on it.

OneSizeFitsQuorum commented 3 years ago

IMO, this depends on what kinds of properties IoTDB'd like to guarantee for user. If IoTDB says, I won't lost any data in any situation, then persist each log every time is required and relative performance lose should be acceptable. But if it's doesn't do such guarantee, then it's OK to do some optimization. Of course, we can make it configurable just like standalone WAL in IoTDB. As I didn't look into Ratis, I'm not sure if it makes this configurable.

Seems can be configurable. Anyway, Let's speed some time exploring it.

I don't understand, on zhihu, it says "proposal id" is generated by client, and filter when do apply(often implemented by State Machine). So I think the retry logic is transparent for raft, the retry operation is new log for raft. Not sure my understanding is right or not.

maybe you could read my doc of MIT 6.824 lab3, which is slightly more understandable.

OneSizeFitsQuorum commented 3 years ago

To conclude, let's start by investigating Java implementations of some raft libraries, such as Ratis, Sofajraft, etc. In the process of exploring, I have come up with five aspects that need to be paid attention to, and you can add more:

How Raft libraries integrate with the state machine: is there a way like etcd to send committed logs back to the upper level for free processing (best so we can do our own optimizations including consistency model considerations)? Or just abstracting the apply interface and letting us override (which may not be very friendly).
Can policies for Raft log persistence be configured: forceSync, batch sync, buffer, etc.
Are some common Raft optimizations done: batching/ Pipeline /Async apply, which are very common and effective Raft optimizations.
What was done to support multi-raft: was there any pooling done, etc.? This may be of great help in massive raft groups.
Correctness: Is there a lot of UT/IT? Are there any TLA+ test/chaos test validation results? Whether there is a large number of production cases.

jt2594838 commented 3 years ago

Linearizability consists of two aspects: consensus and query implementation, we may have spent some efforts on consensus, but little on query implementation, and you should not blame one for another. Meanwhile, queries in IoTDB may be very different from those in systems like etcd, and the concern of query implementation could be also different.

Relying on another library for core functionalities is risky. Other projects keep bug fixing and refactoring too, it is hard to say if we can catch up with them after some time point. And it would become very difficult to implement any improvements on it, which may eventually make IoTDB lack of competitiveness, as the most important advantage of current IoTDB lies on its superior performance, we would completely lose control if the consensus part becomes a bottleneck. Ans should any major bug occur, we will have to wait another team to fix it, because fixing it ourselves will create forks.

Also from my personal point of view, I would not hope IoTDB to just become an integration of libraries, without owning its core techniques, which will make us easily overtaken by those who are more willing to challenge.

Above all, most importantly, before you turn to another method, you should answer the following questions:

What is the problem of the current method, do I have any proof?
Is the new method really better than the current one, how do I show it?
Is switching to the new method a better choice than improving the current one, is it really that easy? Being doubtful is good, but solid evidence is more convincing.

chengjianyun commented 3 years ago

Linearizability consists of two aspects: consensus and query implementation, we may have spent some efforts on consensus, but little on query implementation, and you should not blame one for another. Meanwhile, queries in IoTDB may be very different from those in systems like etcd, and the concern of query implementation could be also different.

Can't agree more.

Relying on another library for core functionalities is risky. Other projects keep bug fixing and refactoring too, it is hard to say if we can catch up with them after some time point. And it would become very difficult to implement any improvements on it, which may eventually make IoTDB lack of competitiveness, as the most important advantage of current IoTDB lies on its superior performance, we would completely lose control if the consensus part becomes a bottleneck. Ans should any major bug occur, we will have to wait another team to fix it, because fixing it ourselves will create forks.

The raft libraries often only provides the ability of consensus on Raft log, it doesn't affect business logic much and doesn't force everything is linearization. With proper 3rd party library, we could have enough flexibility to optimize and improvement performance. The production 3rd party libraries should have been verified by some great productions so that we can believe no major bugs happen.

Also from my personal point of view, I would not hope IoTDB to just become an integration of libraries, without owning its core techniques, which will make us easily overtaken by those who are more willing to challenge.

Above all, most importantly, before you turn to another method, you should answer the following questions:

What is the problem of the current method, do I have any proof?

Updated in description of the issue.

Is the new method really better than the current one, how do I show it?

From view of availability, maintainability and code structure, I think yes, the new method would be better than current one. For performance, I'm not sure as it needs experiment. But I think performance lose will be small or no performance lose. Because current implementation of Raft doesn't apply most of optimizations which has been made in other system such as async append, async apply, etc.

Is switching to the new method a better choice than improving the current one, is it really that easy?

The safety properties is too agile to guarantee without enough tests for Raft. Building up a systematic test cases is really hard and looks is impossible under currently implementation(Raft and Business logic mix together). Of course, integrate the new library means we may change current structure, the effort can be evaluated after investigation.

Being doubtful is good, but solid evidence is more convincing.

jt2594838 commented 3 years ago

No bug for now does not mean no bug forever. And different situations and applications expose different problems. The important thing is building our own verification instead of trusting others just because they have tested on their own cases.

I am more than sure that we will have to face some bugs whatever we use, and if just several bugs would make you lose your heart, I think it is probably not likely you will build any confidence in your own creation, and you may never get near the core of some technologies.

I am not sure about what you mean by async append, but we have async apply. And we are very serious about performance problems as performance is one of the few advantages of IoTDB and why we would want to build our own consensus. Otherwise, why do you think we would bother so much. We want to continue this advantage instead of handing it out to others.

I do not understand why you insist that Raft logic is mixed with business logic, as far as I know, only the implementations of partition table and applier concern how stand-alone IoTDB works, and they are already made interface. Most parts of the cluster module can work without any interference with the standalone part, and can also be tested as any raft implementation.

chengjianyun commented 3 years ago

No bug for now does not mean no bug forever. And different situations and applications expose different problems. The important thing is building our own verification instead of trusting others just because they have tested on their own cases.

I am more than sure that we will have to face some bugs whatever we use, and if just several bugs would make you lose your heart, I think it is probably not likely you will build any confidence in your own creation, and you may never get near the core of some technologies.

I think the discussion out of the scope, let's narrow down the scope to the topic. IMO, IoTDB is a production, when we deliver the production to user, we need to tell user how good/bad are we. It's nothing to do with technologies or confidence. Many production are based on others implementation such as TiKV depends on RocksDB which opensourced by Facebook. Raft library is only a tool, import it could reduce unnecessary works and provide a relative high confidence module to sync data to different node.

I am not sure about what you mean by async append, but we have async apply. And we are very serious about performance problems as performance is one of the few advantages of IoTDB and why we would want to build our own consensus. Otherwise, why do you think we would bother so much. We want to continue this advantage instead of handing it out to others.

Agree that the performance is one of the advantages. But even the fastest vehicle should be safe in my understanding. Do you agree this?

Async append means leader can append log locally and broadcast log parallelly.

For async apply, not sure my understanding is right or not, but according code, I think it's sync in current IoTDB implemenation. The appendEntries acquires lock of logManager until get response which means apply complete. And applyEntries in RaftLogManager is also a sync operation. Correct me if my understanding is wrong.

appendEntries in RaftManager:
private long appendEntries(long prevLogIndex, long prevLogTerm, long leaderCommit, List<Log> logs) {
// ... ignore lines
    synchronized (logManager) {
      long startTime = Timer.Statistic.RAFT_RECEIVER_APPEND_ENTRY.getOperationStartTime();
      resp = logManager.maybeAppend(prevLogIndex, prevLogTerm, leaderCommit, logs);
      Timer.Statistic.RAFT_RECEIVER_APPEND_ENTRY.calOperationCostTimeFromStart(startTime);
      if (resp != -1) {
        if (logger.isDebugEnabled()) {
          logger.debug("{} append a new log list {}, commit to {}", name, logs, leaderCommit);
        }
        resp = Response.RESPONSE_AGREE;
      } else {
        // the incoming log points to an illegal position, reject it
        resp = Response.RESPONSE_LOG_MISMATCH;
      }
    }
    return resp;
  }

// applyEntries in RaftLogManager
  void applyEntries(List<Log> entries) {
    for (Log entry : entries) {
      applyEntry(entry);
    }
  }

  public void applyEntry(Log entry) {
    // For add/remove logs in data groups, this log will be applied immediately when it is
    // appended to the raft log.
    // In this case, it will apply a log that has been applied.
    if (entry.isApplied()) {
      return;
    }
    if (blockAppliedCommitIndex > 0 && entry.getCurrLogIndex() > blockAppliedCommitIndex) {
      blockedUnappliedLogList.add(entry);
      return;
    }
    try {
      logApplier.apply(entry);
    } catch (Exception e) {
      entry.setException(e);
      entry.setApplied(true);
    }
  }

I do not understand why you insist that Raft logic is mixed with business logic, as far as I know, only the implementations of partition table and applier concern how stand-alone IoTDB works, and they are already made interface. Most parts of the cluster module can work without any interference with the standalone part, and can also be tested as any raft implementation.

I think we could have a much clear interface definitions and responsibility boundary if we can refine the abstract of Raft implementation. E.g. DataGroupMember and MetaGroupMember are extend from RaftMember, but RaftMember mainly handles Raft related logic such as appendEntry. But DataGroupMember and MetaGroupMember are handlers of application logic such as query or non-query plans.

OneSizeFitsQuorum commented 3 years ago

And applyEntries in RaftLogManager is also a sync operation. Correct me if my understanding is wrong.

You will see an AsyncDataLogApplier if you go further. And we did a lot more rather than just async apply. And I call it Parallel async apply before, which means data from different storage groups can be applied in parallel. You can check the code again.

jt2594838 commented 3 years ago

Most reliable safety is built on the top of deep knowledge of the techniques you use, not how others say it is safe. Sure a lib it is, but what are "unnecessary works"? Our initial purpose was to create a fully controllable cluster module that will not become a bottleneck in the future. Consensus may be the most important thing above all, and it deserves our primary efforts. It is not JUST a tool, it could be the foundation of the whole cluster module.

Of course safety matters, but how can you assert we can not make it safe? Are we really less capable than those who have written those libs, especially we can integrate the experiences of all of them as they are all open-source. I think this is what open-source is all about: we learn from others instead of we copy from others. After all, open-source or not, you can still use compiled libs from others, while being open-source, you learn from them what are the good practices and bad practices so that you can improve your own.

The sync is used because the client wants a result so we must wait until the end of execution, if the result is not desired, the block can be easily removed to achieve a fully synchronized method.

I am not sure what you mean by a more clear interface. RaftMember provides basic Raft implementation, based on it, MetaGroupMember forms a meta group to manage metadata, DataGroupMember forms data groups to manage data. Raft logic and business logic are obviously separated, and I think it is clear enough. Please specify your problems if any.

neuyilan commented 3 years ago

I think the current focus is on the following points:

Can we implement a good and correct raft framework.

The answer must be YES, but WHEN? One year or two? With the development of other time-series databases, such as influxdb_ iox、tdengine、matrixdb. Their market share is increasing. As an industrial product, if they want to have their own competitiveness, they must have the characteristics of rapid iteration and stability. However, over the past year, there is no industrial version of cluster version of IoTDB. From this perspective, I prefer to build a usable and secure distributed IoTDB version first. No matter who uses the raft framework, and I think it is a faster way to integrate one that has been widely used in the industry than to implement it yourself.

How good is the performance of the raft framework implemented by yourself? As far as we know, the performance improvement of our cluster version is as follows:

2.1 async apply； 2.2 The query does not follow the raft, and the consistency of the query can be determined according to the configuration. But as far as I know, the apply function of Apache Ratisframework can be decided by users. Moreover, the read request of Ratis does not follow the raft, so I don't think the performance will degrade much.
What is a competitive product? A safe, stable and widely used product is a competitive product. If there are no available products for users, what about competitiveness?

jt2594838 commented 3 years ago

You may think it is easier to switch to a new frame than to improve the current one, but it is totally not the case. Implementing a distributed version involves much more than calling some methods in a library. Interface adaption, schema conversion, exception handling, cluster organization, and data distribution... there are so many to do. On the other hand, the problems you face now are just worries, doubts, and lack of tests, instead of major issues that will take months to fix.

The paralleled log execution, controllable log replication dispatching, batching, and some ongoing improvements. You may have missed some of them. And since many of us come from a university lab, we will be able to put more state-of-the-art improvements into it.

SQL Server and Oracle are very safe and stable, though they are nothing about competitiveness in this field. One may speed up to take some share at the early stage, but without its core, it will be soon driven out of the market.

jt2594838 commented 3 years ago

By the way, I checked the recent Jira issues that are related to the cluster module, only to find very, very few of them are related to the consensus implementation. So I sincerely doubt merely changing to another consensus implementation can solve any problem.

OneSizeFitsQuorum commented 3 years ago

You may think it is easier to switch to a new frame than to improve the current one, but it is totally not the case. Implementing a distributed version involves much more than calling some methods in a library. Interface adaption, schema conversion, exception handling, cluster organization, and data distribution... there are so many to do.

The goal for all of us was to make the cluster module more stable: some poeple felt it was better to use a mature Raft library because Raft was hard to be implemented correctly, but I also observed Kafka writing their own Raft instead of using other Raft libraries. Other people don't feel the need to make big changes right now because they don't seem to have many problems at the moment, but using a great Raft library like etcd frees up the consensus bottleneck of the entire cloud native.

Being pragmatic, we have to admit that even if we decided to use another raft library, it wouldn't have been possible in a month or two. Also, most of the bugs we've fixed so far have nothing to do with consensus. Therefore, the decision to move precipitously requires a great deal of risk.

So I suggest we go in three directions in parallel: research + refactoring + testing.

Research: Follow my list of 5 concerns to see how other libraries are doing; Understanding these things will help us understand the Raft algorithm more deeply. Whether the Raft library is replaced or not, this is better for cluster IoTDB because the people who developed it know more about Raft.
Refactoring: Since we've recently started refactoring cluster code, I thought we could refactor Raft code as well. Ideally, it should be a single module, like Kafka. For anyone who wants to change raft library, this is basically the process of changing library, and we must to do some abstraction in order to replace current Raft implementation. For those who don't want to change raft library, doing so can improve code readability and make it easier to add more complex tests. After modularization, we can enumerate some performance comparisons and pros and cons before we discuss whether raft libraries need to be replaced. At that point, I don't think we would be as divergent as we are now.
Testing: After nearly a year of maintaining cluster modules, I believe that most of the bugs fixed so far have nothing to do with consensus and will show up even replacing raft library. Of course, I'm not saying that the Raft we implemented currently had no bugs, it was probably due to a lack of testing and a lack of production cases. Therefore, I suggest that we can fully test the cluster module from now on, and according to the test results we can make the next step of judgment. In addition, I am currently investigating and designing cluster chaos-test framework. If everything goes well, I will have a chaos testing framework that is easy to deploy at the end of September, and we can also use this framework to test cluster‘s stability. Welcome to join me

chengjianyun commented 3 years ago

Summary different opinions and reorg the content to make things clear.

Current situation of cluster availability

https://github.com/apache/iotdb/pull/3939 The bug is found by reviewing code.
https://github.com/apache/iotdb/pull/3930 The bug is found by error msg in a restart fail case. Actually the error msg has been ignored for a long time because server can start normally in most cases.
https://github.com/apache/iotdb/pull/3848 This is a defensive change after reviewing code.
https://github.com/apache/iotdb/pull/3832 The bug is found in a sever can't start case.

The bugs are find/fixed by us in recent month. Some of the bugs is found by reviewing code and look like obvious and the other are reported in Raft unrelated issue. I have reasons to believe that there are much more bugs we don't know. Some of the bug may cause data lose, as IoTDB is most used in big data scenario, it's hard to feel when a small part of data lost. That's the worst situation, we don't know when that will happen and what result it will cause.

Goal of discussion

Find a way to improve availability of cluster of IoTDB and keep outstanding performance in a short time(2 months around in my opinion).

Assumption

The assumptions are the base of the discussion. We don't need to discussion on these any more.

Chosen 3rd party library of Raft has been widely used in industry, has high quality and the algorithm is correct in known cases. As has been widely used, I think we could believe known cases are all cases we could ever have.
An application scenario may cause many scenarios in Raft.

How to achieve the goal

Import 3rd party Raft framework

Plan

Abstract proper interfaces for Raft, make it a module gradually .
At same time, we can investigate some 3rd party libraries and find if we can have a proper one.
Import the chosen library to replace our raft module if needed.

Of course, things are not as easy as said but it's clear. After this, the tests can be focused on application scenarios which are much less than raft scenarios.

Advantages

Doable
Raft correctness is guaranteed
Simplify tests, no UT for Raft

Disadvantages

Maybe lose some performance
Lost control of Raft, update need a relative long term
Can't customize raft to improve performance

Improvement base on current implementation

Plan

Advantages

Could customize raft to optimize performance.
Raft update could be done in a short term.

Disadvantages

Correctness of Raft is hard to guarantee.
It's hard to make raft tests complete.

My thought

I don't support the solution that improve current implementation because:

I don't know the current tests for raft are complete or not
I don't know how far we are from to make raft test cases complete

As TiKV implemented all tests of raft in etcd(after it has been stabilized for years) with RUST, if we could do something similar in a short time, I think its OK improve base on current implementation by ourselves. And optimize raft at this stage is not a good idea, please let's make it correct first.

Supplement if I miss anything!

jt2594838 commented 3 years ago

The listed issues concentrate on raft log management, which is already a known issue that needs refactoring, and it only contributes to a small part of the whole raft framework. While there are about 100 issues concerning the usage over raft. It is much more reasonable to think switching to a new framework will cause significantly more issues than fixing the current one.

The tests of raft is merely a small part of the whole system, as the algorithm itself is quite simple and is well explained in the paper. What really matters is how you integrate raft with your own business, which, unfortunately, cannot be helped no matter what framework you use. So if you think using an existing framework could save a lot of tests, that is not the case. However, what is sure is that you will have to follow someone else's design pattern to redesign your interfaces and make adaptions, accepting their system and application assumption, which may not be easy and pleasent.

jt2594838 commented 3 years ago

My suggestion is, please give us a list of things your preferred frameworks used to qualify their raft implementation (tests to be specific, and please restrict to raft only). Then we will discuss whether it is easier to add the missing tests or directly using their implementation. I think this is much more convincing than saying "I do not know", "I worry", "I doubt".

jixuan1989 commented 3 years ago

Hi folks,

Very glad to see the deep discussion about this topic.

Actually we had several similar discussions in the last 2 years. Even last year, I did not give up trying to use Ratis [1]. However, we finally give up the solution.

Some guys may read the paper "Apache IoTDB 分布式架构设计" [2]. The paper is written in 2019, and at that time, we used sofa-jraft[3], and we should release the cluster version at that time but finally, we gave up. The experience told us using an existing implementation is not as beautiful as we thought. In that version, I asked many times "why you design classes and codes like this", the answer is always "I have to because of the underlying implementation".

We finally blame the performance problem and architecture scalability problem on the third-part library. I know the conclusion may be not correct, but it is hard to guarantee whether the history will replay again...

What is more important, Raft and multi-raft are not our destinations. We believe in the time series management scenario, Raft is over-heavy. Using an existing implementation will limit us finally. and I do not want to re-implement the cluster module once again at that time...

I know our implementation may have many bugs and we omit many corner cases now. But it may not be a bad thing. The progress will let us understand the system deeply.

IMO, we really need to:

if we can decouple the implementation, it is better. (But it seems there is few things to do according to Tian Jiang's above opinion)
If we consider others have been fully tested, we can learn their test cases to make our implementation more robost.

The answer must be YES, but WHEN? One year or two?

When writing here, I remember another discussion: when we want to implement the monitoring module for IoTDB, which 3rd library should we use? micrometer, dropwizard or others? Some of them have rich features but some of them have better performance. So, how to choose? Finally, we decide to extract a framework and let users choose what they want to use.

So, if we can do some refactor and then allow let users to switch the implementation, I can accept that.

But it is hard to convince me to give up our implementation. In my view, it is a kind of drinking the poison to kill the thirsty.

[1] https://lists.apache.org/thread.html/3c49bc234415417511226c7032009868f9d3475026adf07e2e41f99f%40%3Cdev.ratis.apache.org%3E [2] http://scis.scichina.com/cn/2020/SSI-2019-0189.pdf [3] https://github.com/sofastack/sofa-jraft

jixuan1989 commented 3 years ago

BTW, I always agree to provide a no-bug or no critical bug version to industrial users. It depends on all of your contributions. I believe we can achieve that if all of us contribute together.

jt2594838 commented 3 years ago

There is also another more practical problem. Both Raft and IoTDB have their own log management, and we aim to combine them or replace one with another. I think state machine safety is more important than consensus safety since breaking consensus requires more than one node to malfunction while a single node failure may break state machine safety, so IoTDB WAL will probably be the core of the final log.

However, the log management is enclosed in other frameworks as it normally does not interfere with the business logic, so it will be extremely hard to achieve the purpose of unifying logs with one of the implementation is hidden behind.

chengjianyun commented 3 years ago

Hi, all

Thanks for all your opinions and discussions.

I'd like to close the discussion in a few days. Based on current implementation I don't think we can integrate 3rd party Raft library and the debate here meaningless right now. But no matter we finally decide to apply 3rd party library or not, I think all of us agree that we should decouple the raft and state machine implementation and improve quality at same time. Although this is very hard as have discussed, but I think we can set the aim and achieve it by small step refactors, new design consideration, etc. New discussion should be arise once there is some plan about the decouple.

Any concerns, please let me know.