Closed freddyrios closed 11 months ago
Confirmed.
State of the leader immediately after stopping of the follower:
[System.Runtime]
% Time in GC since last GC (%) 0
Allocation Rate (B / 1 sec) 272,072
CPU Usage (%) 1
Exception Count (Count / 1 sec) 129
GC Committed Bytes (MB) 38
GC Fragmentation (%) 0.193
GC Heap Size (MB) 18
Gen 0 GC Count (Count / 1 sec) 0
Gen 0 Size (B) 192
Gen 1 GC Count (Count / 1 sec) 0
Gen 1 Size (B) 2,057,480
Gen 2 GC Count (Count / 1 sec) 0
Gen 2 Size (B) 600,912
IL Bytes Jitted (B) 303,868
LOH Size (B) 4,130,024
Monitor Lock Contention Count (Count / 1 sec) 0
Number of Active Timers 2
Number of Assemblies Loaded 54
Number of Methods Jitted 4,390
POH (Pinned Object Heap) Size (B) 48,360
ThreadPool Completed Work Item Count (Count / 1 sec) 188
ThreadPool Queue Length 0
ThreadPool Thread Count 9
Time spent in JIT (ms / 1 sec) 0
Working Set (MB) 102
30 mins later:
[System.Runtime]
% Time in GC since last GC (%) 0
Allocation Rate (B / 1 sec) 399,528
CPU Usage (%) 1
Exception Count (Count / 1 sec) 127
GC Committed Bytes (MB) 50
GC Fragmentation (%) 0.189
GC Heap Size (MB) 28
Gen 0 GC Count (Count / 1 sec) 0
Gen 0 Size (B) 192
Gen 1 GC Count (Count / 1 sec) 0
Gen 1 Size (B) 3,379,200
Gen 2 GC Count (Count / 1 sec) 0
Gen 2 Size (B) 600,912
IL Bytes Jitted (B) 308,623
LOH Size (B) 14,747,920
Monitor Lock Contention Count (Count / 1 sec) 0
Number of Active Timers 2
Number of Assemblies Loaded 54
Number of Methods Jitted 4,437
POH (Pinned Object Heap) Size (B) 48,360
ThreadPool Completed Work Item Count (Count / 1 sec) 193
ThreadPool Queue Length 0
ThreadPool Thread Count 9
Time spent in JIT (ms / 1 sec) 0
Working Set (MB) 115
It seems some large buffers are leaking because of LOH growth.
Root cause analysis: currently, WAL is using memory pooling to rent LogEntry[]
array internally. When all members are healthy, the array has fixed length in every heartbeat iteration. When one of the them becomes unreachable, the leader cannot make any progress on it. Thus, the number of entries is growing over the time. At the next heartbeat iteration, unavailable member is 2 log entries behind, then 3, etc. As a result, LogEntry[]
array is growing. ArrayPool
in the same time keeps strong references to all requested arrays. Every new requested array has a length larger than existing arrays in the pool. Therefore, GC cannot collected those arrays until full blocking GC (ArrayPool
releases all internal arrays in case of full blocking GC).
Mitigation: The first attempt I made was to fix calculation of the expected length of the array (because we have snapshotting, so the length cannot grow indefinitely in practice), and change pooling mechanism in the way that allows to avoid keeping strong references on unused arrays. This approach is implemented by the commits represented above. I called this a naive approach for the following reasons:
To completely fix the issue and keep hot path optimized, I decided to change the whole approach. Reader expects generic TList
that implements IReadOnlyList<LogEntry>
. Thus, we are free to switch from materialized collection (such as array) to lazy ephemeral collection. In that case, the memory needed to represent the collection with all log entries doesn't depend on the number of log entries inside of the collection. In other words, such a collection has O(1) space complexity in contrast to O(n) provided by regular arrays or lists. Of course, there is no free lunch. Usually, reducing complexity in space leads to increasing complexity in time, e.g. with lazy collection now I have O(n) for log entry lookup time. Luckily, the internal design of WAL keeps all partitions in sorted order. Replication process executed for each heartbeat enumerates log entries sequentially, from 0
to the end of the list. In combination with sorted partitions and the mentioned usage pattern for the list, it is possible to implement O(1) lookup time in optimistic case. Voila, now memory allocations are eliminated completely.
All changes done mostly in internal API which is not visible to library consumers except some small modifications in IRaftClusterMember
interface. However, its public API surface is typically not used by the consumers (except situations when someone decided to write its own network transport for Raft). So I think this is acceptable risk.
You can try a new version in develop
branch. I tested in using existing tests and got 26 hours uptime without any memory leaks.
I noticed that with the latest, before killing any nodes, the working set in the linux run keeps growing, not only for the leader but all nodes. Additionally while the leader usually consumes more memory, it seems a bit more than usual now. In the last 30 minutes the leader grew from 106MB to 264MB, and the followers from 77MB to 173MB.
I need to stop the run now, but there are no signs of the working set stabilizing but it looks a lot like it would keep growing. I will do more runs tomorrow.
Growing working set is a weak evidence of memory leak, unfortunately. GC may commit and decommit memory pages on its own choice. Gen2, LOH, and POH sizes are better indicators of memory leak. For instance, https://github.com/dotnet/dotNext/issues/184#issuecomment-1645743653 demonstrates obvious LOH growth over the time. Currently, at least POH and LOH are not growing.
I get it, specially in the context of .net. However, the difference in behavior hints at it will have a leak pattern on the longer run. I will do one and report. If it only reaches 1 GB is normal, but not so much if it grows beyond
Allocation rate of the leader is higher when it tries to reach unavailable member. It happens because of exceptions. Not sure that I can do much here. It's by design, in any exceptional cases .NET prefer exceptions. No connection with remote peer? Throw SocketException
. Operation canceled? Throw OperationCanceledException
. A whole chain of exceptions is required to determine whether the member is not available. All these exceptions require memory allocation in every heartbeat iteration. Do not build logic around exceptions is a good advice in theory, but sometimes it's not practical (especially when there is no choice).
Startup:
[System.Runtime]
% Time in GC since last GC (%) 0
Allocation Rate (B / 1 sec) 162,784
CPU Usage (%) 0
Exception Count (Count / 1 sec) 62
GC Committed Bytes (MB) 25
GC Fragmentation (%) 1.28
GC Heap Size (MB) 8
Gen 0 GC Count (Count / 1 sec) 0
Gen 0 Size (B) 192
Gen 1 GC Count (Count / 1 sec) 0
Gen 1 Size (B) 860,624
Gen 2 GC Count (Count / 1 sec) 0
Gen 2 Size (B) 192
IL Bytes Jitted (B) 298,445
LOH Size (B) 192
Monitor Lock Contention Count (Count / 1 sec) 0
Number of Active Timers 2
Number of Assemblies Loaded 54
Number of Methods Jitted 4,326
POH (Pinned Object Heap) Size (B) 48,360
ThreadPool Completed Work Item Count (Count / 1 sec) 102
ThreadPool Queue Length 0
ThreadPool Thread Count 9
Time spent in JIT (ms / 1 sec) 0
Working Set (MB) 89
2 hours uptime:
[System.Runtime]
% Time in GC since last GC (%) 0
Allocation Rate (B / 1 sec) 172,768
CPU Usage (%) 0
Exception Count (Count / 1 sec) 69
GC Committed Bytes (MB) 34
GC Fragmentation (%) 7.567
GC Heap Size (MB) 8
Gen 0 GC Count (Count / 1 sec) 0
Gen 0 Size (B) 192
Gen 1 GC Count (Count / 1 sec) 0
Gen 1 Size (B) 540,048
Gen 2 GC Count (Count / 1 sec) 0
Gen 2 Size (B) 1,069,736
IL Bytes Jitted (B) 324,805
LOH Size (B) 192
Monitor Lock Contention Count (Count / 1 sec) 0
Number of Active Timers 2
Number of Assemblies Loaded 54
Number of Methods Jitted 4,602
POH (Pinned Object Heap) Size (B) 48,360
ThreadPool Completed Work Item Count (Count / 1 sec) 101
ThreadPool Queue Length 0
ThreadPool Thread Count 11
Time spent in JIT (ms / 1 sec) 0
Working Set (MB) 102
LOH and POH are not growing. GC Heap Size is not growing, but during runtime it was larger than 8 MB. Its max size was 25 MB during observation. Anyway, it explains that GC works correctly, collecting garbage and resetting heap size to its initial size at startup. Working set is growing (slowly), however its growth ratio correlates with GC Heap Fragmentation (holes in the heap). Explanation: GC see no reason to trigger full blocking GC which leads to defrag of GC heap. Instead, it prefer to request more memory pages from OS (which grow the working set).
Sorry for the noise, the leak is definitely fixed in the target environment, which is arm 32 raspberry os (debian based). In this environment the memory stays around 50MB regardless of a follower going down. Also the leader's working set is less than 1MB extra.
The growing working set behavior was observed in x64 wsl debian (453MB and growing after about 3 hours). However, this environment was only used to share a reproduction of the original issue that could run on a regular desktop, so from our side it can be ignored. I can let it run in for longer in case you are interested in the working set eventually stopping growing, as after all the working set is supposed to be pages that are in physical memory (unlike virtual memory).
I do notice plenty of elections while running the example with the 10 times x second saves (even when all 3 nodes are running). This happens both when running the nodes in 3 different pis, or when running the nodes in the same pi. In one hour the term is 228 so far. I'll run the earlier version to confirm this is not a regression.
The run with the latest version eventually stopped like this (192.168.100.8 was the leader before and after that last leader change):
follower:
Accepting value 20115500
Building snapshot
Accepting value 20116000
Consensus cannot be reached
Term of local cluster member is 255. Election timeout 00:00:00.2360000
New cluster leader is elected. Leader address is 192.168.100.8:3262
Term of local cluster member is 256. Election timeout 00:00:00.2360000
Accepting value 20116500
leader:
Accepting value 20116000
Saving value 20116500 generated by the leader node
Unexpected error System.OperationCanceledException: The operation was canceled.
---> System.OperationCanceledException: The operation was canceled.
at DotNext.Threading.Tasks.ValueTaskCompletionSource`1.GetResult(Int16 token) in /_/src/DotNext.Threa
ding/Threading/Tasks/ValueTaskCompletionSource.T.cs:line 230
at DotNext.Threading.Tasks.ValueTaskCompletionSource`1.System.Threading.Tasks.Sources.IValueTaskSourc
e.GetResult(Int16 token) in /_/src/DotNext.Threading/Threading/Tasks/ValueTaskCompletionSource.T.cs:line
237
at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.ReplicateAsync[TEntry](TEntry entry, Cancellation
Token token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs:line 1239
--- End of inner exception stack trace ---
at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.ReplicateAsync[TEntry](TEntry entry, Cancellation
Token token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs:line 1239
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Thre
ading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
at RaftNode.DataModifier.ExecuteAsync(CancellationToken stoppingToken) in C:\Users\fredd\source\repos
\dotNext\src\Examples\RaftNode\DataModifier.cs:line 32
Consensus cannot be reached
Term of local cluster member is 255. Election timeout 00:00:00.1770000
New cluster leader is elected. Leader address is 192.168.100.8:3262
Term of local cluster member is 256. Election timeout 00:00:00.1770000
Saving value 20116500 generated by the leader node
Accepting value 20116500
Building snapshot
This is the second time it stop making progress with the new version. Earlier I saw it when running all nodes from a single pi and now when running all nodes from different pis (which I did to run as I did in the original run precisely because I got the hang).
I am starting the run with the previous version now.
By the way, restaring the follower resumed making progress
I do notice plenty of elections while running the example with the 10 times x second saves (even when all 3 nodes are running). This happens both when running the nodes in 3 different pis, or when running the nodes in the same pi. In one hour the term is 228 so far. I'll run the earlier version to confirm this is not a regression.
This part is not a regression, as it reproduces with the earlier version. To avoid the leader changes the target environment needed other setting changes that were not needed for the minimal reproduction I shared in the original post.
I will keep it running to validate this other behavior is not a regression https://github.com/dotnet/dotNext/issues/184#issuecomment-1660235577.
I reproduced https://github.com/dotnet/dotNext/issues/184#issuecomment-1660235577 with the old version (14.3), so it seems none of the issues I got with the leak fixes seem to be regressions. So the scope of the issue is done.
The stop of progress also reproduced with the latest fixes in wsl debian of the longer working set growth I was running (reached 650MB before it stopped making progress), but I assume the old version would reproduce it too.
Above said, the latest I have deployed on systems is 4.12.3 + it is running with settings where leader changes are rare so it is hard for me to say if there is a regression causing these stops on recent changes. I will probably do a run with 4.12.3 to see if it also reproduces there, in case there is something that has been introduced in more recent versions.
Alright, I'm preparing a new release with memory leak fix. Feel free to open a new issue.
I reproduced #184 (comment) with the old version (14.3), so it seems none of the issues I got with the leak fixes seem to be regressions. So the scope of the issue is done.
The stop of progress also reproduced with the latest fixes in wsl debian of the longer working set growth I was running (reached 650MB before it stopped making progress), but I assume the old version would reproduce it too.
Above said, the latest I have deployed on systems is 4.12.3 + it is running with settings where leader changes are rare so it is hard for me to say if there is a regression causing these stops on recent changes. I will probably do a run with 4.12.3 to see if it also reproduces there, in case there is something that has been introduced in more recent versions.
Possibly, I have explanation of this behavior. But let do this in a separated task.
super, thanks!
Fixed in 4.13.1
Reproduced it also with the latest in main 14.3.
It also reproduces in linux - debian wsl.
To reproduce:
dotnet publish Examples\RaftNode -p:PublishProfile=FolderProfile -c Release -r linux-x64 --self-contained -f net6.0