AutoMQ / automq

AutoMQ is a cloud-first alternative to Kafka by decoupling durability to S3 and EBS. 10x cost-effective. Autoscale in seconds. Single-digit ms latency.
https://www.automq.com/docs
Other
3.59k stars 177 forks source link

[Enhancement] WriteAheadLog with sequentially callback #1718

Open superhx opened 1 month ago

superhx commented 1 month ago

Problem

Currently, BlockWALService persists data blocks in parallel, responding directly to the upper layer with success as soon as any data block is persisted, even if the previous data block has not completed persistence. Although this method provides "better" write latency, it shifts the responsibility of ensuring sequentiality to the upper layer, making the upper layer logic more complex.

Expectation

Implement a SequetialBlockWALService:

CLFutureX commented 1 month ago

@superhx Hey, after careful consideration of the current write model on my end, I've found that to retain the current parallel write model while implementing sequential callback semantics on top, it might be more appropriate to process it at the stage after the block has been written. image

After the block is written to the WAL, consider serializing the process of writing the record to the logCache to reduce the complexity at the upper layer.

  1. Create a single-threaded callbackExecutor.
  2. When the I/O thread completes the write, trigger the callback operation of the Request, hand over the handleAppendCallback to the callbackExecutor to complete, and perform the sequential callback within the callbackExecutor.

for example: com.automq.stream.s3.S3Storage#append0 image

This approach can avoid concurrency control of the callbackSequencer, while also decoupling business-related operations from I/O operations: since the current I/O thread is not only responsible for writing the block but also for handling the request callback processing chain;

Potential issues: Considering under a large number of requests, processing callbacks may lead to significant processing pressure, but for now, they are all pure memory operations. If there is concern about delays in ack responses, then it is actually possible to consider designing the callbackExecutor as a multi-threaded model, routing to threads in the object thread pool based on streamId,thereby ensuring the orderliness maintained by the stream. Alternatively, consider setting up an ack response thread pool, dedicated to handling ack requests upwards.

Chillax-0v0 commented 1 month ago

@CLFutureX Hi, I think you might have misunderstood @superhx's idea. In his description,

it shifts the responsibility of ensuring sequentiality to the upper layer, making the upper layer logic more complex

the "upper layer" mentioned here refers to S3Storage rather than the caller of S3Storage.

Specifically, you can see that due to the current non-sequentially callback, we have added a lot of complex logic in S3Storage, such as WALCallbackSequencer and WALConfirmOffsetCalculator. This has made the S3Storage code overly complex and difficult to maintain.

What we want to do now is to thoroughly refactor BlockWALService to make it callback sequentially, thereby reducing the complexity of S3Storage.

CLFutureX commented 1 month ago

@CLFutureX Hi, I think you might have misunderstood @superhx's idea. In his description,

it shifts the responsibility of ensuring sequentiality to the upper layer, making the upper layer logic more complex

the "upper layer" mentioned here refers to S3Storage rather than the caller of S3Storage.

Specifically, you can see that due to the current non-sequentially callback, we have added a lot of complex logic in S3Storage, such as WALCallbackSequencer and WALConfirmOffsetCalculator. This has made the S3Storage code overly complex and difficult to maintain.

What we want to do now is to thoroughly refactor BlockWALService to make it callback sequentially, thereby reducing the complexity of S3Storage.

Ye, this will be a big project

CLFutureX commented 3 weeks ago

@superhx @Chillax-0v0 hey, please take a look

Design for Ordered Return with Concurrent Writing

Expectations:

  1. Maintain the model of concurrent writing
  2. Ensure ordered return So the order here can be designed to be stream-dimension order, which means ensuring the orderliness of partitioned writing.

image

Plan Design:

  1. Based on the existing block, abstract the internal logical partition;
  2. Utilize the multi-thread model EventLoopGroup to replace the existing pollBlockScheduler + ioExecutor for data writing;
  3. Add a single ioThread, responsible for adhering to IOPS limits to complete data force operations.

Block

Continuing the current block design, it remains unchanged externally, but internally builds logical partitions, routing to different blockPartitions through streamid, thus achieving imperceptible to the outside.

Based on partition distribution: After a block is written, it is directly routed and distributed to different eventloop threads based on block partitioning.

Calculation of blockPartition's startOffset: Combining the current writing logic: writing based on the startOffset of the block, so during the distribution phase, it is necessary to calculate the startOffset for each partition.

How to calculate it? example: blockPartition1.startOffset = block.startOffset; blockPartition2.startOffest = block.startOffset +alignLargeByBlockSize(blockPartition1.size)

eventloopGroup

Parallel Writing: The event loop can perform parallel writing based on the existing writing model. Completion Notification: Writing does not necessarily mean that the actual writing is complete; it requires waiting for the IO Thread to execute the force method. Thus, the CompletableFuture object after writing is transferred to an internal queue for completed and pending notification items. In this way, an orderly parallel writing of streams is achieved at the upper layer.

IO thread

Free IOPS Limit: To continue the limitation of IOPS, the IO thread is responsible for executing a force once per 1/3000 by default. (Further optimization is possible here: when the event loop writes, it updates a globally visible flag, and the IO thread decides whether to execute force based on the write flag during execution, to avoid empty force executions.) Notification to Upper Layers: How to notify the upper event loop to complete the corresponding CompletableFuture response after the force? Plan 1: Lock control, when the IO thread executes force, it prohibits the event loop from continuing to write. After writing is finished, notify the event loop group. After the event loop receives the notification, it traverses the completed response collection. Plan 2: Special task flag When the IO thread starts writing, it inserts a special object into the completed notification collection of each event loop. After the force is completed, notify the event loop that the writing is finished. So, after receiving the notification, the event loop starts to traverse the completed notification collection until it encounters the special object. In this way, the notification is completed based on the task flag. With this, the parallel writing and ordered return of WAL is completed.

The benefits of the plan include:

● Achieved orderly writing based on partitioning. ● The entire process is lock-free. ● Clear division of responsibilities and well-defined boundaries at process nodes: Writing operations, event loops, and I/O threads each take on independent responsibilities with clear functional boundaries, and there is no interference between them (for example the current interference issues between writing operations and polling operations). ● It is simple and easy to maintain. Following this, we can then eliminate the current WALCallbackSequencer class.