basho / riak_kv

Riak Key/Value Store
Apache License 2.0
650 stars 233 forks source link

Memory fragmentation #1820

Open martinsumner opened 2 years ago

martinsumner commented 2 years ago

There have been reports from users of Riak suffering memory fragmentation with large clusters with leveled backends.

This appears to be an issue with both OTP 22.3 as well as OTP R16basho.

Primary indicator is:

recon_alloc:memory(usage).

Which can be between 50% and 80%.

This is not universal, some leveled/Riak clusters do not see this. However, on clusters that do see it there is limited mitigation other than restarting nodes. The excess memory takes from the Page Cache, and so can have a direct impact on performance. In extreme cases unexpected and sudden memory exhaustion has been observed.

Horizontal scaling currently appears to offer better mitigation than vertical scaling.

The fragmentation is primarily with the eheap_alloc and binary_alloc and associated with inefficient use of multiblock carriers, primarily these carriers not seeming to ever transition to empty and be reclaimed.

martinsumner commented 2 years ago

Changing the single block carrier threshold to a lower value on the eheap_alloc has a very negative impact. This change does not help

erlang.eheap_memory.sbct = 128

recon_alloc:memory(usage).
0.34210991984944206
(dev11@192.168.6.11)5> erlang:memory().          
[{total,7731430560},
 {processes,6054481520},
 {processes_used,6054116632},
 {system,1676949040},
 {atom,1017017},
 {atom_used,1009277},
 {binary,1307030592},
 {code,15398935},
 {ets,300506192}]
(dev11@192.168.6.11)6> recon_alloc:memory(allocated).
22463975424
(dev11@192.168.6.11)7>  rp(recon_alloc:fragmentation(current)).
[{{eheap_alloc,7},
  [{sbcs_usage,0.21770690720393554},
   {mbcs_usage,0.6661469330658784},
   {sbcs_block_size,384285504},
   {sbcs_carriers_size,1765150720},
   {mbcs_block_size,16152944},
   {mbcs_carriers_size,24248320}]},
 {{eheap_alloc,23},
  [{sbcs_usage,0.23321187683249425},
   {mbcs_usage,0.5346889303188131},
   {sbcs_block_size,249886832},
   {sbcs_carriers_size,1071501312},
   {mbcs_block_size,6938192},
   {mbcs_carriers_size,12976128}]},
 {{eheap_alloc,1},
...
martinsumner commented 2 years ago

Testing shows that in some cases, either reducing the lmbcs (in this case to 1280KB), or setting ramv to true for both binary and eheap allocators can reduce the memory footprint of Riak (by better controlling fragmentation):

MemoryFragment_Memory

However, the change to enable ramv comes at a CPU cost (as expected):

MemoryFragment_CPU
martinsumner commented 2 years ago

The memory manipulation had an impact, but a relatively minor impact. Most of the memory being used by the BEAM in this test is as a result of leveled consuming memory for caches. Testing with a change to the leveled_sst cache to periodically release the blockindex_cache when not busy (and also have more GC triggers within the code), changing SST caching has a bigger difference:

MemoryCompare_HostMemFootprint

Get requests are unevenly spread around the cluster with leveled, the GET request will be directed to the node which responded fastest to the HEAD request. So by examining how the GET requests were distributed, we can see which nodes were performing better during the test:

MemoryCompare_GetShare

Now looking at CPU by node, there is very little difference, so the "sst+ramv" node was doing more work and doing that work with relative efficiency:

MemoryCompare_CPULoad

Following this point in the test, one node had issues, and scanning of the full store for either intra or inter cluster AAE increased. Those nodes with the sst change then showed higher disk utilisation (as block index caches were fetched from disk), but maintained their memory advantage. Hypothesis is that releasing and regaining the cache has disk costs, but reduces fragmentation.

hmmr commented 2 years ago

Wouldn't it be appropriate, in addition to exposing various parameters affecting various allocators (https://github.com/basho/riak/tree/mas-i1820-controls), to expose flags to enable/disable specific allocators as such? My reasoning is that, since

When an allocator is disabled, sys_alloc is used instead of the disabled allocator.

we can hope to minimize any adversarial effects of having multiple allocators in active use simultaneously. I can imagine there exist such allocation patterns (say, of binaries vs whatever objects VM thinks are 'short-lived') that will tend to induce fragmentation more easily/reliably. Disabling those special-purpose allocators will then make it easier for the default allocator (sys_alloc) to deal with fragmentation as it will see the more complete map of allocated blocks and free space. The default allocator maps to libc malloc, which I think has seen sufficient polish and fragmentation proofing (but don't call me out on that though).

martinsumner commented 2 years ago

As well as experimenting with memory settings, I've also done some experimentation with scheduler settings. It was noticeable with changes in scheduler settings I was getting variances in memory usage ... and sometimes improvements when compared to different memory settings.

The scheduler changes have shown that by default too many schedulers are starting, and this is causing a huge number of context switches, and consequently higher CPU (and presumably inefficient usage of memory caches).

What is interesting here is the non-default settings that basho introduced to counter problems with scheduler collapse in R16 seem to be causing issues, and that we get better behaviour with current stock erlang settings.

I have some more tests to run, before I have clear answers, but currently I'm considering if supporting changes here may cause operational overheads ... and overall we're better sticking to well tested beam defaults for whatever version we're running. When I start looking at all factors combined (memory, CPU use, latency, CPU efficiency), the default settings start to look more desirable.

Will post-up results as I get them.

@hmmr - so with regards to exposing disabling of allocators, I'm not sure. I might not expose/test that at this stage .. and in general if I think it looks like moving towards rather than away from beam defaults is the right way to go, I would be cautious about doing it at all.

martinsumner commented 2 years ago

One interesting thing that has been seen, is in the test 1 of the 4 nodes is put under greater pressure than the other nodes. In this case we get this strange situation where msacc shows all schedulers very busy, but the OS does not show the equivalent CPU business.

So on the node top reports the equivalent of about 14 busy CPUs:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                         
 8589 root      20   0 24.686g 0.015t   5476 S  1374 16.5  12725:00 beam.smp   

But looking at microstate accounting at this time:

(dev11@192.168.6.11)1> msacc:start(10000).
true
(dev11@192.168.6.11)2> msacc:print().
Average thread real-time    :  10005371 us
Accumulated system run-time : 282190754 us
Average scheduler run-time  :   6997919 us

        Thread      aux check_io emulator       gc    other     port    sleep

Stats per thread:
     async( 0)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
     async( 1)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
     async( 2)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
     async( 3)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
       aux( 1)   20.89%    0.51%    0.00%    0.00%    0.16%    0.00%   78.45%
dirty_cpu_( 1)    0.00%    0.00%    0.29%   26.42%    0.87%    0.00%   72.42%
dirty_cpu_( 2)    0.00%    0.00%    0.29%   34.74%    1.16%    0.00%   63.80%
dirty_cpu_( 3)    0.00%    0.00%    0.14%   22.66%    0.69%    0.00%   76.52%
dirty_cpu_( 4)    0.00%    0.00%    0.18%   27.57%    0.87%    0.00%   71.38%
dirty_cpu_( 5)    0.00%    0.00%    0.31%   29.68%    0.91%    0.00%   69.10%
dirty_cpu_( 6)    0.00%    0.00%    0.45%   27.61%    0.95%    0.00%   70.99%
dirty_cpu_( 7)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_( 8)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_( 9)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_(10)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_(11)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_(12)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_io_s( 1)    0.00%    0.00%   69.90%    0.00%    9.17%    0.00%   20.93%
dirty_io_s( 2)    0.00%    0.00%   69.71%    0.00%    8.58%    0.00%   21.72%
dirty_io_s( 3)    0.00%    0.00%   69.07%    0.00%    9.41%    0.00%   21.52%
dirty_io_s( 4)    0.00%    0.00%   69.03%    0.00%    9.05%    0.00%   21.92%
dirty_io_s( 5)    0.00%    0.00%   70.62%    0.00%    8.19%    0.00%   21.19%
dirty_io_s( 6)    0.00%    0.00%   70.78%    0.00%    8.91%    0.00%   20.31%
dirty_io_s( 7)    0.00%    0.00%   70.06%    0.00%    8.72%    0.00%   21.22%
dirty_io_s( 8)    0.00%    0.00%   69.54%    0.00%    8.63%    0.00%   21.84%
dirty_io_s( 9)    0.00%    0.00%   69.41%    0.00%    8.77%    0.00%   21.83%
dirty_io_s(10)    0.00%    0.00%   69.69%    0.00%    8.96%    0.00%   21.35%
dirty_io_s(11)    0.00%    0.00%   69.89%    0.00%    8.32%    0.00%   21.79%
dirty_io_s(12)    0.00%    0.00%   70.52%    0.00%    8.87%    0.00%   20.62%
      poll( 0)    0.00%    0.66%    0.00%    0.00%    0.00%    0.00%   99.34%
 scheduler( 1)    4.48%    0.19%   62.99%   21.05%    3.40%    0.80%    7.08%
 scheduler( 2)    4.86%    0.19%   62.64%   21.34%    3.53%    0.81%    6.62%
 scheduler( 3)    4.38%    0.20%   63.29%   21.45%    3.36%    0.81%    6.52%
 scheduler( 4)    4.23%    0.19%   62.81%   21.40%    3.55%    0.86%    6.96%
 scheduler( 5)    4.30%    0.21%   63.01%   21.36%    3.53%    0.76%    6.83%
 scheduler( 6)    4.32%    0.22%   64.29%   20.29%    3.36%    0.66%    6.86%
 scheduler( 7)    4.48%    0.23%   63.57%   20.68%    3.45%    0.89%    6.70%
 scheduler( 8)    4.31%    0.18%   63.17%   21.52%    3.36%    0.79%    6.68%
 scheduler( 9)    4.14%    0.22%   63.84%   20.83%    3.42%    0.88%    6.67%
 scheduler(10)    4.14%    0.19%   63.47%   21.39%    3.48%    0.78%    6.54%
 scheduler(11)    4.30%    0.19%   63.75%   20.51%    3.43%    0.80%    7.03%
 scheduler(12)    4.35%    0.22%   62.91%   21.62%    3.52%    0.77%    6.61%
 scheduler(13)    4.19%    0.24%   61.97%   21.96%    3.63%    0.97%    7.03%
 scheduler(14)    4.74%    0.23%   61.25%   22.48%    3.62%    0.82%    6.87%
 scheduler(15)    4.27%    0.18%   63.32%   21.58%    3.32%    0.71%    6.63%
 scheduler(16)    4.24%    0.19%   62.98%   21.84%    3.33%    0.82%    6.59%
 scheduler(17)    4.39%    0.20%   63.12%   21.42%    3.60%    0.77%    6.50%
 scheduler(18)    4.49%    0.20%   63.63%   20.76%    3.54%    0.75%    6.62%
 scheduler(19)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
 scheduler(20)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
 scheduler(21)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   99.99%
 scheduler(22)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
 scheduler(23)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
 scheduler(24)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%

Stats per type:
         async    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
           aux   20.89%    0.51%    0.00%    0.00%    0.16%    0.00%   78.45%
dirty_cpu_sche    0.00%    0.00%    0.14%   14.06%    0.45%    0.00%   85.35%
dirty_io_sched    0.00%    0.00%   69.85%    0.00%    8.80%    0.00%   21.35%
          poll    0.00%    0.66%    0.00%    0.00%    0.00%    0.00%   99.34%
     scheduler    3.28%    0.15%   47.33%   15.98%    2.60%    0.60%   30.06%

There are 18 schedulers that are only sleeping around 6% of the time, and 12 dirty_io schedulers that are only sleeping 20% of the time ... which amounts to more than 2600% CPU!!!

When the node is in this state, although it is reporting a lot of CPU time spent on GC, it isn't clear GC is occurring as the largest processes my memory usage have large amounts of collectable garbage. It is in this state that we can see the memory usage by the beam on the node increasing.

martinsumner commented 2 years ago

during the same test, at the same time, on a "healthy" node in the cluster:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                         
36026 root      20   0 26.183g 0.017t   5480 S  1145 18.1  11528:10 beam.smp 
(dev12@192.168.6.12)1> msacc:start(10000).
true
(dev12@192.168.6.12)2> msacc:print().
Average thread real-time    :  10002026 us
Accumulated system run-time : 178505482 us
Average scheduler run-time  :   4410064 us

        Thread      aux check_io emulator       gc    other     port    sleep

Stats per thread:
     async( 0)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
     async( 1)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
     async( 2)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
     async( 3)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
       aux( 1)    4.08%    0.34%    0.00%    0.00%    0.11%    0.00%   95.47%
dirty_cpu_( 1)    0.00%    0.00%    0.07%    4.70%    0.20%    0.00%   95.02%
dirty_cpu_( 2)    0.00%    0.00%    0.05%    2.38%    0.13%    0.00%   97.44%
dirty_cpu_( 3)    0.00%    0.00%    0.02%    3.63%    0.18%    0.00%   96.17%
dirty_cpu_( 4)    0.00%    0.00%    0.07%    2.70%    0.16%    0.00%   97.07%
dirty_cpu_( 5)    0.00%    0.00%    0.00%    0.29%    0.02%    0.00%   99.69%
dirty_cpu_( 6)    0.00%    0.00%    0.05%    3.09%    0.14%    0.00%   96.72%
dirty_cpu_( 7)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_( 8)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_( 9)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_(10)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_(11)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_(12)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_io_s( 1)    0.00%    0.00%   52.39%    0.00%    4.63%    0.00%   42.98%
dirty_io_s( 2)    0.00%    0.00%   55.88%    0.00%    4.71%    0.00%   39.41%
dirty_io_s( 3)    0.00%    0.00%   53.49%    0.00%    4.67%    0.00%   41.84%
dirty_io_s( 4)    0.00%    0.00%   59.09%    0.00%    5.33%    0.00%   35.59%
dirty_io_s( 5)    0.00%    0.00%   57.04%    0.00%    4.62%    0.00%   38.34%
dirty_io_s( 6)    0.00%    0.00%   51.64%    0.00%    4.71%    0.00%   43.65%
dirty_io_s( 7)    0.00%    0.00%   48.15%    0.00%    4.25%    0.00%   47.61%
dirty_io_s( 8)    0.00%    0.00%   47.58%    0.00%    4.91%    0.00%   47.51%
dirty_io_s( 9)    0.00%    0.00%   58.74%    0.00%    4.87%    0.00%   36.39%
dirty_io_s(10)    0.00%    0.00%   57.07%    0.00%    5.15%    0.00%   37.78%
dirty_io_s(11)    0.00%    0.00%   52.86%    0.00%    4.72%    0.00%   42.42%
dirty_io_s(12)    0.00%    0.00%   52.42%    0.00%    4.63%    0.00%   42.95%
      poll( 0)    0.00%    0.53%    0.00%    0.00%    0.00%    0.00%   99.47%
 scheduler( 1)    2.40%    0.19%   31.46%    7.20%   18.40%    0.86%   39.50%
 scheduler( 2)    2.23%    0.18%   29.31%    6.87%   18.94%    0.83%   41.64%
 scheduler( 3)    2.12%    0.22%   30.32%    7.40%   19.26%    0.73%   39.96%
 scheduler( 4)    2.27%    0.19%   30.15%    7.50%   19.07%    0.78%   40.04%
 scheduler( 5)    2.34%    0.19%   31.03%    7.28%   19.27%    0.84%   39.03%
 scheduler( 6)    2.30%    0.17%   29.93%    7.42%   18.81%    0.68%   40.68%
 scheduler( 7)    2.45%    0.17%   29.16%    7.50%   18.91%    0.70%   41.10%
 scheduler( 8)    2.22%    0.18%   30.32%    6.87%   19.04%    0.66%   40.72%
 scheduler( 9)    2.28%    0.18%   29.19%    7.28%   18.82%    0.79%   41.47%
 scheduler(10)    2.28%    0.18%   28.56%    7.11%   19.17%    0.77%   41.94%
 scheduler(11)    2.24%    0.18%   30.12%    6.96%   18.89%    0.76%   40.85%
 scheduler(12)    2.31%    0.20%   29.76%    7.29%   18.94%    0.73%   40.77%
 scheduler(13)    2.38%    0.21%   30.18%    7.06%   19.19%    0.68%   40.30%
 scheduler(14)    2.25%    0.20%   30.32%    7.28%   19.09%    0.75%   40.11%
 scheduler(15)    2.34%    0.22%   30.58%    7.39%   18.81%    0.72%   39.93%
 scheduler(16)    2.28%    0.21%   29.44%    7.29%   19.11%    0.71%   40.97%
 scheduler(17)    2.30%    0.18%   29.20%    7.43%   18.52%    0.72%   41.65%
 scheduler(18)    2.04%    0.15%   23.48%    5.65%   16.95%    0.59%   51.14%
 scheduler(19)    0.00%    0.00%    0.00%    0.00%    0.02%    0.00%   99.98%
 scheduler(20)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
 scheduler(21)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
 scheduler(22)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
 scheduler(23)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
 scheduler(24)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%

Stats per type:
         async    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
           aux    4.08%    0.34%    0.00%    0.00%    0.11%    0.00%   95.47%
dirty_cpu_sche    0.00%    0.00%    0.02%    1.40%    0.07%    0.00%   98.51%
dirty_io_sched    0.00%    0.00%   53.86%    0.00%    4.77%    0.00%   41.37%
          poll    0.00%    0.53%    0.00%    0.00%    0.00%    0.00%   99.47%
     scheduler    1.71%    0.14%   22.19%    5.37%   14.13%    0.55%   55.91%

On the"healthy" node a lot of the msacc reported CPU busyness is in other - which includes IO wait time, so accounting for that there seems to be a closer match between OS and beam reported CPU (these are 24 vCPU nodes).

martinsumner commented 2 years ago

Prior to reducing the scheduler counts, the problem described above got much worse in some tests - i.e. schedulers busy, but reported CPU not as high, memory use increasing.

In some cases some long-running tasks appeared to almost stop altogether, and memory usage increased rapidly to 40GB.

martinsumner commented 2 years ago

A summary of the current understanding of memory issues.

The following improvements are therefore to be added to 3.0.10:

  1. When a SQN check is processed by a leveled_sst file, it will set a timeout, and if there are no further requests at the timeout the caches will be flushed, and the process will hibernate. This means that the background journal compaction process will flush as well as build caches on infrequently accessed files, and prompt churn of memory

  2. The frequency of "slow-offers", artificial delays on PUTs will be increased during work backlogs. This frequency will be tuneable, as will the pause length (which is already tuneable).

  3. There will now be prompted garbage collection on both the penciller and penciller's clerk processes immediately after infrequent events which touch a large number of binaries on the shared heap.

  4. The VM memory settings will be exposed via riak.conf, but no defaults will be changed. Currently, it is considered preferable to resolve these issues by changing the code rather than trying to tune the VM. This is not really an issue with the VM and fragmentation, the fragmentation is symptom of poor practice within the leveled code.

  5. There MAY be a change to prompt regularly to flush memory to disk when a backlog occurs, even if no new PUTs arrive.

martinsumner commented 2 years ago

1 - 4 are implemented in riak 3.0.10.

Also in 3.0.10, another significant cause of memory issues in leveled was detected - https://github.com/martinsumner/leveled/pull/377.

With these fixes, fix (5) has been deferred - https://github.com/martinsumner/leveled/pull/376.

It does look very likely that memory issues are not VM-related, they are all in effect leveled bugs.

Also in 3.0.10 are a series of fixes to use disk-backed queues rather than pure memory queues to avoid another area where memory can run out of control.

martinsumner commented 1 year ago

Current issue on a production cluster running with 30GB in use according to erlang:memory() but > 130GB in use according to the OS.

The issue is focused on the single-block carrier threshold, with fragmentation running at < 20% across all (48 in this case) eheap allocators e.g.

{{eheap_alloc,15},

  [{sbcs_usage,0.16793810855712812},

   {mbcs_usage,0.5080543553773401},

   {sbcs_block_size,588764160},

   {sbcs_carriers_size,3505840128},

   {mbcs_block_size,42685280},

   {mbcs_carriers_size,84017152}]},

Looking to experiment with https://www.erlang.org/doc/man/erts_alloc.html#M_rsbcst - and reverse out recommendations about lowering sbct.