leader memory leak when a node is down in linux - raspberry or debian

dotnet / dotNext

Next generation API for .NET

https://dotnet.github.io/dotNext/

MIT License

1.56k stars 119 forks source link

leader memory leak when a node is down in linux - raspberry or debian #184

Closed freddyrios closed 11 months ago

freddyrios commented 11 months ago

Reproduced it also with the latest in main 14.3.

It also reproduces in linux - debian wsl.

To reproduce:

modify the DataModifier of the RaftNode example so that it saves every 0.1 seconds (10 times x second)
for x64 run (also reproduces in arm devices): dotnet publish Examples\RaftNode -p:PublishProfile=FolderProfile -c Release -r linux-x64 --self-contained -f net6.0
run the 3 nodes as usual
a few mins later kill one of the non leader nodes
monitor the memory and confirm a few minutes later the leader has gained significant more memory
if you leave it running its memory usage keeps growing / I have had a working set of well over 1GB

sakno commented 11 months ago

Confirmed.

State of the leader immediately after stopping of the follower:

[System.Runtime]
    % Time in GC since last GC (%)                                         0    
    Allocation Rate (B / 1 sec)                                      272,072    
    CPU Usage (%)                                                          1    
    Exception Count (Count / 1 sec)                                      129    
    GC Committed Bytes (MB)                                               38    
    GC Fragmentation (%)                                                   0.193
    GC Heap Size (MB)                                                     18    
    Gen 0 GC Count (Count / 1 sec)                                         0    
    Gen 0 Size (B)                                                       192    
    Gen 1 GC Count (Count / 1 sec)                                         0    
    Gen 1 Size (B)                                                 2,057,480    
    Gen 2 GC Count (Count / 1 sec)                                         0    
    Gen 2 Size (B)                                                   600,912    
    IL Bytes Jitted (B)                                              303,868    
    LOH Size (B)                                                   4,130,024    
    Monitor Lock Contention Count (Count / 1 sec)                          0    
    Number of Active Timers                                                2    
    Number of Assemblies Loaded                                           54    
    Number of Methods Jitted                                           4,390    
    POH (Pinned Object Heap) Size (B)                                 48,360    
    ThreadPool Completed Work Item Count (Count / 1 sec)                 188    
    ThreadPool Queue Length                                                0    
    ThreadPool Thread Count                                                9    
    Time spent in JIT (ms / 1 sec)                                         0    
    Working Set (MB)                                                     102

30 mins later:

[System.Runtime]
    % Time in GC since last GC (%)                                         0    
    Allocation Rate (B / 1 sec)                                      399,528    
    CPU Usage (%)                                                          1    
    Exception Count (Count / 1 sec)                                      127    
    GC Committed Bytes (MB)                                               50    
    GC Fragmentation (%)                                                   0.189
    GC Heap Size (MB)                                                     28    
    Gen 0 GC Count (Count / 1 sec)                                         0    
    Gen 0 Size (B)                                                       192    
    Gen 1 GC Count (Count / 1 sec)                                         0    
    Gen 1 Size (B)                                                 3,379,200    
    Gen 2 GC Count (Count / 1 sec)                                         0    
    Gen 2 Size (B)                                                   600,912    
    IL Bytes Jitted (B)                                              308,623    
    LOH Size (B)                                                  14,747,920    
    Monitor Lock Contention Count (Count / 1 sec)                          0    
    Number of Active Timers                                                2    
    Number of Assemblies Loaded                                           54    
    Number of Methods Jitted                                           4,437    
    POH (Pinned Object Heap) Size (B)                                 48,360    
    ThreadPool Completed Work Item Count (Count / 1 sec)                 193    
    ThreadPool Queue Length                                                0    
    ThreadPool Thread Count                                                9    
    Time spent in JIT (ms / 1 sec)                                         0   
    Working Set (MB)                                                     115

It seems some large buffers are leaking because of LOH growth.

sakno commented 11 months ago

Root cause analysis: currently, WAL is using memory pooling to rent LogEntry[] array internally. When all members are healthy, the array has fixed length in every heartbeat iteration. When one of the them becomes unreachable, the leader cannot make any progress on it. Thus, the number of entries is growing over the time. At the next heartbeat iteration, unavailable member is 2 log entries behind, then 3, etc. As a result, LogEntry[] array is growing. ArrayPool in the same time keeps strong references to all requested arrays. Every new requested array has a length larger than existing arrays in the pool. Therefore, GC cannot collected those arrays until full blocking GC (ArrayPool releases all internal arrays in case of full blocking GC).

Mitigation: The first attempt I made was to fix calculation of the expected length of the array (because we have snapshotting, so the length cannot grow indefinitely in practice), and change pooling mechanism in the way that allows to avoid keeping strong references on unused arrays. This approach is implemented by the commits represented above. I called this a naive approach for the following reasons:

Allocation of the array is still required for each read (which is hot path)
Unused arrays from the pool due to their lifetime will be always in Gen 2. So they can be reclaimed in case of full GC only

To completely fix the issue and keep hot path optimized, I decided to change the whole approach. Reader expects generic TList that implements IReadOnlyList<LogEntry>. Thus, we are free to switch from materialized collection (such as array) to lazy ephemeral collection. In that case, the memory needed to represent the collection with all log entries doesn't depend on the number of log entries inside of the collection. In other words, such a collection has O(1) space complexity in contrast to O(n) provided by regular arrays or lists. Of course, there is no free lunch. Usually, reducing complexity in space leads to increasing complexity in time, e.g. with lazy collection now I have O(n) for log entry lookup time. Luckily, the internal design of WAL keeps all partitions in sorted order. Replication process executed for each heartbeat enumerates log entries sequentially, from 0 to the end of the list. In combination with sorted partitions and the mentioned usage pattern for the list, it is possible to implement O(1) lookup time in optimistic case. Voila, now memory allocations are eliminated completely.

All changes done mostly in internal API which is not visible to library consumers except some small modifications in IRaftClusterMember interface. However, its public API surface is typically not used by the consumers (except situations when someone decided to write its own network transport for Raft). So I think this is acceptable risk.

You can try a new version in develop branch. I tested in using existing tests and got 26 hours uptime without any memory leaks.

sakno commented 11 months ago

There is an implementation of lazy list: https://github.com/dotnet/dotNext/blob/34d0c1075a51b4039bd79bd23c6326a99271675b/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.Internal.cs#L361-L452

freddyrios commented 11 months ago

I noticed that with the latest, before killing any nodes, the working set in the linux run keeps growing, not only for the leader but all nodes. Additionally while the leader usually consumes more memory, it seems a bit more than usual now. In the last 30 minutes the leader grew from 106MB to 264MB, and the followers from 77MB to 173MB.

I need to stop the run now, but there are no signs of the working set stabilizing but it looks a lot like it would keep growing. I will do more runs tomorrow.

sakno commented 11 months ago

Growing working set is a weak evidence of memory leak, unfortunately. GC may commit and decommit memory pages on its own choice. Gen2, LOH, and POH sizes are better indicators of memory leak. For instance, https://github.com/dotnet/dotNext/issues/184#issuecomment-1645743653 demonstrates obvious LOH growth over the time. Currently, at least POH and LOH are not growing.

freddyrios commented 11 months ago

I get it, specially in the context of .net. However, the difference in behavior hints at it will have a leak pattern on the longer run. I will do one and report. If it only reaches 1 GB is normal, but not so much if it grows beyond

sakno commented 11 months ago

Allocation rate of the leader is higher when it tries to reach unavailable member. It happens because of exceptions. Not sure that I can do much here. It's by design, in any exceptional cases .NET prefer exceptions. No connection with remote peer? Throw SocketException. Operation canceled? Throw OperationCanceledException. A whole chain of exceptions is required to determine whether the member is not available. All these exceptions require memory allocation in every heartbeat iteration. Do not build logic around exceptions is a good advice in theory, but sometimes it's not practical (especially when there is no choice).

sakno commented 11 months ago

Startup:

[System.Runtime]
    % Time in GC since last GC (%)                                         0    
    Allocation Rate (B / 1 sec)                                      162,784    
    CPU Usage (%)                                                          0    
    Exception Count (Count / 1 sec)                                       62    
    GC Committed Bytes (MB)                                               25    
    GC Fragmentation (%)                                                   1.28 
    GC Heap Size (MB)                                                      8    
    Gen 0 GC Count (Count / 1 sec)                                         0    
    Gen 0 Size (B)                                                       192    
    Gen 1 GC Count (Count / 1 sec)                                         0    
    Gen 1 Size (B)                                                   860,624    
    Gen 2 GC Count (Count / 1 sec)                                         0    
    Gen 2 Size (B)                                                       192    
    IL Bytes Jitted (B)                                              298,445    
    LOH Size (B)                                                         192    
    Monitor Lock Contention Count (Count / 1 sec)                          0    
    Number of Active Timers                                                2    
    Number of Assemblies Loaded                                           54    
    Number of Methods Jitted                                           4,326    
    POH (Pinned Object Heap) Size (B)                                 48,360    
    ThreadPool Completed Work Item Count (Count / 1 sec)                 102    
    ThreadPool Queue Length                                                0    
    ThreadPool Thread Count                                                9    
    Time spent in JIT (ms / 1 sec)                                         0    
    Working Set (MB)                                                      89

2 hours uptime:

[System.Runtime]
    % Time in GC since last GC (%)                                         0    
    Allocation Rate (B / 1 sec)                                      172,768    
    CPU Usage (%)                                                          0    
    Exception Count (Count / 1 sec)                                       69    
    GC Committed Bytes (MB)                                               34    
    GC Fragmentation (%)                                                   7.567
    GC Heap Size (MB)                                                      8    
    Gen 0 GC Count (Count / 1 sec)                                         0    
    Gen 0 Size (B)                                                       192    
    Gen 1 GC Count (Count / 1 sec)                                         0    
    Gen 1 Size (B)                                                   540,048    
    Gen 2 GC Count (Count / 1 sec)                                         0    
    Gen 2 Size (B)                                                 1,069,736    
    IL Bytes Jitted (B)                                              324,805    
    LOH Size (B)                                                         192    
    Monitor Lock Contention Count (Count / 1 sec)                          0    
    Number of Active Timers                                                2    
    Number of Assemblies Loaded                                           54    
    Number of Methods Jitted                                           4,602    
    POH (Pinned Object Heap) Size (B)                                 48,360    
    ThreadPool Completed Work Item Count (Count / 1 sec)                 101    
    ThreadPool Queue Length                                                0    
    ThreadPool Thread Count                                               11    
    Time spent in JIT (ms / 1 sec)                                         0
    Working Set (MB)                                                     102

LOH and POH are not growing. GC Heap Size is not growing, but during runtime it was larger than 8 MB. Its max size was 25 MB during observation. Anyway, it explains that GC works correctly, collecting garbage and resetting heap size to its initial size at startup. Working set is growing (slowly), however its growth ratio correlates with GC Heap Fragmentation (holes in the heap). Explanation: GC see no reason to trigger full blocking GC which leads to defrag of GC heap. Instead, it prefer to request more memory pages from OS (which grow the working set).

freddyrios commented 11 months ago

Sorry for the noise, the leak is definitely fixed in the target environment, which is arm 32 raspberry os (debian based). In this environment the memory stays around 50MB regardless of a follower going down. Also the leader's working set is less than 1MB extra.

The growing working set behavior was observed in x64 wsl debian (453MB and growing after about 3 hours). However, this environment was only used to share a reproduction of the original issue that could run on a regular desktop, so from our side it can be ignored. I can let it run in for longer in case you are interested in the working set eventually stopping growing, as after all the working set is supposed to be pages that are in physical memory (unlike virtual memory).

I do notice plenty of elections while running the example with the 10 times x second saves (even when all 3 nodes are running). This happens both when running the nodes in 3 different pis, or when running the nodes in the same pi. In one hour the term is 228 so far. I'll run the earlier version to confirm this is not a regression.

freddyrios commented 11 months ago

The run with the latest version eventually stopped like this (192.168.100.8 was the leader before and after that last leader change):

follower:

Accepting value 20115500
Building snapshot
Accepting value 20116000
Consensus cannot be reached
Term of local cluster member is 255. Election timeout 00:00:00.2360000
New cluster leader is elected. Leader address is 192.168.100.8:3262
Term of local cluster member is 256. Election timeout 00:00:00.2360000
Accepting value 20116500

leader:

Accepting value 20116000                                                                       
Saving value 20116500 generated by the leader node
Unexpected error System.OperationCanceledException: The operation was canceled.
 ---> System.OperationCanceledException: The operation was canceled.
   at DotNext.Threading.Tasks.ValueTaskCompletionSource`1.GetResult(Int16 token) in /_/src/DotNext.Threa
ding/Threading/Tasks/ValueTaskCompletionSource.T.cs:line 230
   at DotNext.Threading.Tasks.ValueTaskCompletionSource`1.System.Threading.Tasks.Sources.IValueTaskSourc
e.GetResult(Int16 token) in /_/src/DotNext.Threading/Threading/Tasks/ValueTaskCompletionSource.T.cs:line
 237
   at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.ReplicateAsync[TEntry](TEntry entry, Cancellation
Token token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs:line 1239
   --- End of inner exception stack trace ---
   at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.ReplicateAsync[TEntry](TEntry entry, Cancellation
Token token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs:line 1239
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Thre
ading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
   at RaftNode.DataModifier.ExecuteAsync(CancellationToken stoppingToken) in C:\Users\fredd\source\repos
\dotNext\src\Examples\RaftNode\DataModifier.cs:line 32
Consensus cannot be reached
Term of local cluster member is 255. Election timeout 00:00:00.1770000
New cluster leader is elected. Leader address is 192.168.100.8:3262
Term of local cluster member is 256. Election timeout 00:00:00.1770000
Saving value 20116500 generated by the leader node
Accepting value 20116500
Building snapshot

This is the second time it stop making progress with the new version. Earlier I saw it when running all nodes from a single pi and now when running all nodes from different pis (which I did to run as I did in the original run precisely because I got the hang).

I am starting the run with the previous version now.

freddyrios commented 11 months ago

By the way, restaring the follower resumed making progress

freddyrios commented 11 months ago

I do notice plenty of elections while running the example with the 10 times x second saves (even when all 3 nodes are running). This happens both when running the nodes in 3 different pis, or when running the nodes in the same pi. In one hour the term is 228 so far. I'll run the earlier version to confirm this is not a regression.

This part is not a regression, as it reproduces with the earlier version. To avoid the leader changes the target environment needed other setting changes that were not needed for the minimal reproduction I shared in the original post.

I will keep it running to validate this other behavior is not a regression https://github.com/dotnet/dotNext/issues/184#issuecomment-1660235577.

freddyrios commented 11 months ago

I reproduced https://github.com/dotnet/dotNext/issues/184#issuecomment-1660235577 with the old version (14.3), so it seems none of the issues I got with the leak fixes seem to be regressions. So the scope of the issue is done.

The stop of progress also reproduced with the latest fixes in wsl debian of the longer working set growth I was running (reached 650MB before it stopped making progress), but I assume the old version would reproduce it too.

Above said, the latest I have deployed on systems is 4.12.3 + it is running with settings where leader changes are rare so it is hard for me to say if there is a regression causing these stops on recent changes. I will probably do a run with 4.12.3 to see if it also reproduces there, in case there is something that has been introduced in more recent versions.

sakno commented 11 months ago

Alright, I'm preparing a new release with memory leak fix. Feel free to open a new issue.

sakno commented 11 months ago

I reproduced #184 (comment) with the old version (14.3), so it seems none of the issues I got with the leak fixes seem to be regressions. So the scope of the issue is done.

The stop of progress also reproduced with the latest fixes in wsl debian of the longer working set growth I was running (reached 650MB before it stopped making progress), but I assume the old version would reproduce it too.

Above said, the latest I have deployed on systems is 4.12.3 + it is running with settings where leader changes are rare so it is hard for me to say if there is a regression causing these stops on recent changes. I will probably do a run with 4.12.3 to see if it also reproduces there, in case there is something that has been introduced in more recent versions.

Possibly, I have explanation of this behavior. But let do this in a separated task.

freddyrios commented 11 months ago

super, thanks!

sakno commented 11 months ago

Fixed in 4.13.1