Open cigarl opened 3 years ago
Hi, this is your first issue in IoTDB project. Thanks for your report. Welcome to join the community!
For the first point, the framing mechanism is actually mentioned in Raft's paper when he talks about the snapshot implementation, and we should approach the implementation this way as well. Can you record an issue and we will evaluate the priority redevelopment then?
On the second point, the work on mTree Snapshot seems important in this scenario. I think the judgment about Peer Recovery is also correct.
I suggest you ask these questions in community group or mailing list after setting up an issue, so that more people will pay attention and participate in the discussion. This is more likely to be scheduled and resolved sooner.
I suggest you ask these questions in community group or mailing list after setting up an issue, so that more people will pay attention and participate in the discussion. This is more likely to be scheduled and resolved sooner.
Thanks for your reminding, I will send an email to the community later, and try to address some of the problems in this process next week.(eg., we might have duplicate requests in the catchupTask
)
Description
When my environment(3 nodes,3 replicas) has network fluctuations, or a node is overloaded and responds slowly, after it rejoined the cluster,i find that
CatchupTask
may cause my cluster to be corrupted. So I did some analysis and found the following. Please correct me if there is anything wrong.Question
CatchupTask
does not control the size of data on a single slot. In another word,it can result in too many schema or files on a slot (like slot[981] and slot[911]),this could be a heavy operation. Besides, since we limit the maximum size of thrift frame to 512 MB,that means the request can not be sent to another node successfully.When a slot is blocked in a request, the request fails and is retried repeatedly.
At the same time, as the operation increases, the request becomes larger and will never be successfully executed. These threads are taking up resources and the number of threads is increasing.
local recovery
andpeer recovery
is not controlled.local recovery
could be slow due to a largemlog.bin
,butpeer recovery
has begun. Although I haven't find what was wrong with it, it was obvious that the CPU load was climbing and the log files were reporting a lot of errors(BecauseCatchupTask
has restored the schema, thelocal recovery
starts repeating these operations).Some thinking
Maybe, In
catchupTask
, we need to control both the size of a request and the amount of data on a single slot. Assuming that the number of schema on a single slot is one million, we might need to split it into 10 or more operations. Also, we need to control the size of the entire request to ensure that it does not exceed the thrift Frame limit (512MB).In addition, when a node is restarted, the
local recovery
should be prioritized. Thepeer recovery
can start only after the local recovery is complete. And we need to consider whether the conditions formtree-snapshot
are too strict, If no snapshot is taken for a long time, themlog.bin
file is too large and the recovery speed of nodes in the cluster is inconsistent, which may cause other problems(For example,nodes with slow recovery speed cannot be connected with others, repeated operations during recovery, and so on).WDYT?