Memory fragmentation - Githubissues

There have been reports from users of Riak suffering memory fragmentation with large clusters with leveled backends.

This appears to be an issue with both OTP 22.3 as well as OTP R16basho.

Primary indicator is:

recon_alloc:memory(usage).

Which can be between 50% and 80%.

This is not universal, some leveled/Riak clusters do not see this. However, on clusters that do see it there is limited mitigation other than restarting nodes. The excess memory takes from the Page Cache, and so can have a direct impact on performance. In extreme cases unexpected and sudden memory exhaustion has been observed.

Horizontal scaling currently appears to offer better mitigation than vertical scaling.

The fragmentation is primarily with the eheap_alloc and binary_alloc and associated with inefficient use of multiblock carriers, primarily these carriers not seeming to ever transition to empty and be reclaimed.

Changing the single block carrier threshold to a lower value on the eheap_alloc has a very negative impact. This change does not help

erlang.eheap_memory.sbct = 128

recon_alloc:memory(usage).
0.34210991984944206
(dev11@192.168.6.11)5> erlang:memory().          
[{total,7731430560},
 {processes,6054481520},
 {processes_used,6054116632},
 {system,1676949040},
 {atom,1017017},
 {atom_used,1009277},
 {binary,1307030592},
 {code,15398935},
 {ets,300506192}]
(dev11@192.168.6.11)6> recon_alloc:memory(allocated).
22463975424
(dev11@192.168.6.11)7>  rp(recon_alloc:fragmentation(current)).
[{{eheap_alloc,7},
  [{sbcs_usage,0.21770690720393554},
   {mbcs_usage,0.6661469330658784},
   {sbcs_block_size,384285504},
   {sbcs_carriers_size,1765150720},
   {mbcs_block_size,16152944},
   {mbcs_carriers_size,24248320}]},
 {{eheap_alloc,23},
  [{sbcs_usage,0.23321187683249425},
   {mbcs_usage,0.5346889303188131},
   {sbcs_block_size,249886832},
   {sbcs_carriers_size,1071501312},
   {mbcs_block_size,6938192},
   {mbcs_carriers_size,12976128}]},
 {{eheap_alloc,1},
...

Testing shows that in some cases, either reducing the lmbcs (in this case to 1280KB), or setting ramv to true for both binary and eheap allocators can reduce the memory footprint of Riak (by better controlling fragmentation):

However, the change to enable ramv comes at a CPU cost (as expected):

The memory manipulation had an impact, but a relatively minor impact. Most of the memory being used by the BEAM in this test is as a result of leveled consuming memory for caches. Testing with a change to the leveled_sst cache to periodically release the blockindex_cache when not busy (and also have more GC triggers within the code), changing SST caching has a bigger difference:

Get requests are unevenly spread around the cluster with leveled, the GET request will be directed to the node which responded fastest to the HEAD request. So by examining how the GET requests were distributed, we can see which nodes were performing better during the test:

Now looking at CPU by node, there is very little difference, so the "sst+ramv" node was doing more work and doing that work with relative efficiency:

Following this point in the test, one node had issues, and scanning of the full store for either intra or inter cluster AAE increased. Those nodes with the sst change then showed higher disk utilisation (as block index caches were fetched from disk), but maintained their memory advantage. Hypothesis is that releasing and regaining the cache has disk costs, but reduces fragmentation.

Wouldn't it be appropriate, in addition to exposing various parameters affecting various allocators (https://github.com/basho/riak/tree/mas-i1820-controls), to expose flags to enable/disable specific allocators as such? My reasoning is that, since

When an allocator is disabled, sys_alloc is used instead of the disabled allocator.

we can hope to minimize any adversarial effects of having multiple allocators in active use simultaneously. I can imagine there exist such allocation patterns (say, of binaries vs whatever objects VM thinks are 'short-lived') that will tend to induce fragmentation more easily/reliably. Disabling those special-purpose allocators will then make it easier for the default allocator (sys_alloc) to deal with fragmentation as it will see the more complete map of allocated blocks and free space. The default allocator maps to libc malloc, which I think has seen sufficient polish and fragmentation proofing (but don't call me out on that though).

As well as experimenting with memory settings, I've also done some experimentation with scheduler settings. It was noticeable with changes in scheduler settings I was getting variances in memory usage ... and sometimes improvements when compared to different memory settings.

The scheduler changes have shown that by default too many schedulers are starting, and this is causing a huge number of context switches, and consequently higher CPU (and presumably inefficient usage of memory caches).

What is interesting here is the non-default settings that basho introduced to counter problems with scheduler collapse in R16 seem to be causing issues, and that we get better behaviour with current stock erlang settings.

I have some more tests to run, before I have clear answers, but currently I'm considering if supporting changes here may cause operational overheads ... and overall we're better sticking to well tested beam defaults for whatever version we're running. When I start looking at all factors combined (memory, CPU use, latency, CPU efficiency), the default settings start to look more desirable.

Will post-up results as I get them.

@hmmr - so with regards to exposing disabling of allocators, I'm not sure. I might not expose/test that at this stage .. and in general if I think it looks like moving towards rather than away from beam defaults is the right way to go, I would be cautious about doing it at all.

One interesting thing that has been seen, is in the test 1 of the 4 nodes is put under greater pressure than the other nodes. In this case we get this strange situation where msacc shows all schedulers very busy, but the OS does not show the equivalent CPU business.

So on the node top reports the equivalent of about 14 busy CPUs:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                         
 8589 root      20   0 24.686g 0.015t   5476 S  1374 16.5  12725:00 beam.smp

But looking at microstate accounting at this time:

(dev11@192.168.6.11)1> msacc:start(10000).
true
(dev11@192.168.6.11)2> msacc:print().
Average thread real-time    :  10005371 us
Accumulated system run-time : 282190754 us
Average scheduler run-time  :   6997919 us

        Thread      aux check_io emulator       gc    other     port    sleep

Stats per thread:
     async( 0)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
     async( 1)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
     async( 2)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
     async( 3)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
       aux( 1)   20.89%    0.51%    0.00%    0.00%    0.16%    0.00%   78.45%
dirty_cpu_( 1)    0.00%    0.00%    0.29%   26.42%    0.87%    0.00%   72.42%
dirty_cpu_( 2)    0.00%    0.00%    0.29%   34.74%    1.16%    0.00%   63.80%
dirty_cpu_( 3)    0.00%    0.00%    0.14%   22.66%    0.69%    0.00%   76.52%
dirty_cpu_( 4)    0.00%    0.00%    0.18%   27.57%    0.87%    0.00%   71.38%
dirty_cpu_( 5)    0.00%    0.00%    0.31%   29.68%    0.91%    0.00%   69.10%
dirty_cpu_( 6)    0.00%    0.00%    0.45%   27.61%    0.95%    0.00%   70.99%
dirty_cpu_( 7)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_( 8)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_( 9)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_(10)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_(11)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_(12)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_io_s( 1)    0.00%    0.00%   69.90%    0.00%    9.17%    0.00%   20.93%
dirty_io_s( 2)    0.00%    0.00%   69.71%    0.00%    8.58%    0.00%   21.72%
dirty_io_s( 3)    0.00%    0.00%   69.07%    0.00%    9.41%    0.00%   21.52%
dirty_io_s( 4)    0.00%    0.00%   69.03%    0.00%    9.05%    0.00%   21.92%
dirty_io_s( 5)    0.00%    0.00%   70.62%    0.00%    8.19%    0.00%   21.19%
dirty_io_s( 6)    0.00%    0.00%   70.78%    0.00%    8.91%    0.00%   20.31%
dirty_io_s( 7)    0.00%    0.00%   70.06%    0.00%    8.72%    0.00%   21.22%
dirty_io_s( 8)    0.00%    0.00%   69.54%    0.00%    8.63%    0.00%   21.84%
dirty_io_s( 9)    0.00%    0.00%   69.41%    0.00%    8.77%    0.00%   21.83%
dirty_io_s(10)    0.00%    0.00%   69.69%    0.00%    8.96%    0.00%   21.35%
dirty_io_s(11)    0.00%    0.00%   69.89%    0.00%    8.32%    0.00%   21.79%
dirty_io_s(12)    0.00%    0.00%   70.52%    0.00%    8.87%    0.00%   20.62%
      poll( 0)    0.00%    0.66%    0.00%    0.00%    0.00%    0.00%   99.34%
 scheduler( 1)    4.48%    0.19%   62.99%   21.05%    3.40%    0.80%    7.08%
 scheduler( 2)    4.86%    0.19%   62.64%   21.34%    3.53%    0.81%    6.62%
 scheduler( 3)    4.38%    0.20%   63.29%   21.45%    3.36%    0.81%    6.52%
 scheduler( 4)    4.23%    0.19%   62.81%   21.40%    3.55%    0.86%    6.96%
 scheduler( 5)    4.30%    0.21%   63.01%   21.36%    3.53%    0.76%    6.83%
 scheduler( 6)    4.32%    0.22%   64.29%   20.29%    3.36%    0.66%    6.86%
 scheduler( 7)    4.48%    0.23%   63.57%   20.68%    3.45%    0.89%    6.70%
 scheduler( 8)    4.31%    0.18%   63.17%   21.52%    3.36%    0.79%    6.68%
 scheduler( 9)    4.14%    0.22%   63.84%   20.83%    3.42%    0.88%    6.67%
 scheduler(10)    4.14%    0.19%   63.47%   21.39%    3.48%    0.78%    6.54%
 scheduler(11)    4.30%    0.19%   63.75%   20.51%    3.43%    0.80%    7.03%
 scheduler(12)    4.35%    0.22%   62.91%   21.62%    3.52%    0.77%    6.61%
 scheduler(13)    4.19%    0.24%   61.97%   21.96%    3.63%    0.97%    7.03%
 scheduler(14)    4.74%    0.23%   61.25%   22.48%    3.62%    0.82%    6.87%
 scheduler(15)    4.27%    0.18%   63.32%   21.58%    3.32%    0.71%    6.63%
 scheduler(16)    4.24%    0.19%   62.98%   21.84%    3.33%    0.82%    6.59%
 scheduler(17)    4.39%    0.20%   63.12%   21.42%    3.60%    0.77%    6.50%
 scheduler(18)    4.49%    0.20%   63.63%   20.76%    3.54%    0.75%    6.62%
 scheduler(19)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
 scheduler(20)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
 scheduler(21)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   99.99%
 scheduler(22)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
 scheduler(23)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
 scheduler(24)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%

Stats per type:
         async    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
           aux   20.89%    0.51%    0.00%    0.00%    0.16%    0.00%   78.45%
dirty_cpu_sche    0.00%    0.00%    0.14%   14.06%    0.45%    0.00%   85.35%
dirty_io_sched    0.00%    0.00%   69.85%    0.00%    8.80%    0.00%   21.35%
          poll    0.00%    0.66%    0.00%    0.00%    0.00%    0.00%   99.34%
     scheduler    3.28%    0.15%   47.33%   15.98%    2.60%    0.60%   30.06%

There are 18 schedulers that are only sleeping around 6% of the time, and 12 dirty_io schedulers that are only sleeping 20% of the time ... which amounts to more than 2600% CPU!!!

When the node is in this state, although it is reporting a lot of CPU time spent on GC, it isn't clear GC is occurring as the largest processes my memory usage have large amounts of collectable garbage. It is in this state that we can see the memory usage by the beam on the node increasing.

during the same test, at the same time, on a "healthy" node in the cluster:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                         
36026 root      20   0 26.183g 0.017t   5480 S  1145 18.1  11528:10 beam.smp

(dev12@192.168.6.12)1> msacc:start(10000).
true
(dev12@192.168.6.12)2> msacc:print().
Average thread real-time    :  10002026 us
Accumulated system run-time : 178505482 us
Average scheduler run-time  :   4410064 us

        Thread      aux check_io emulator       gc    other     port    sleep

Stats per thread:
     async( 0)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
     async( 1)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
     async( 2)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
     async( 3)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
       aux( 1)    4.08%    0.34%    0.00%    0.00%    0.11%    0.00%   95.47%
dirty_cpu_( 1)    0.00%    0.00%    0.07%    4.70%    0.20%    0.00%   95.02%
dirty_cpu_( 2)    0.00%    0.00%    0.05%    2.38%    0.13%    0.00%   97.44%
dirty_cpu_( 3)    0.00%    0.00%    0.02%    3.63%    0.18%    0.00%   96.17%
dirty_cpu_( 4)    0.00%    0.00%    0.07%    2.70%    0.16%    0.00%   97.07%
dirty_cpu_( 5)    0.00%    0.00%    0.00%    0.29%    0.02%    0.00%   99.69%
dirty_cpu_( 6)    0.00%    0.00%    0.05%    3.09%    0.14%    0.00%   96.72%
dirty_cpu_( 7)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_( 8)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_( 9)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_(10)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_(11)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_(12)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_io_s( 1)    0.00%    0.00%   52.39%    0.00%    4.63%    0.00%   42.98%
dirty_io_s( 2)    0.00%    0.00%   55.88%    0.00%    4.71%    0.00%   39.41%
dirty_io_s( 3)    0.00%    0.00%   53.49%    0.00%    4.67%    0.00%   41.84%
dirty_io_s( 4)    0.00%    0.00%   59.09%    0.00%    5.33%    0.00%   35.59%
dirty_io_s( 5)    0.00%    0.00%   57.04%    0.00%    4.62%    0.00%   38.34%
dirty_io_s( 6)    0.00%    0.00%   51.64%    0.00%    4.71%    0.00%   43.65%
dirty_io_s( 7)    0.00%    0.00%   48.15%    0.00%    4.25%    0.00%   47.61%
dirty_io_s( 8)    0.00%    0.00%   47.58%    0.00%    4.91%    0.00%   47.51%
dirty_io_s( 9)    0.00%    0.00%   58.74%    0.00%    4.87%    0.00%   36.39%
dirty_io_s(10)    0.00%    0.00%   57.07%    0.00%    5.15%    0.00%   37.78%
dirty_io_s(11)    0.00%    0.00%   52.86%    0.00%    4.72%    0.00%   42.42%
dirty_io_s(12)    0.00%    0.00%   52.42%    0.00%    4.63%    0.00%   42.95%
      poll( 0)    0.00%    0.53%    0.00%    0.00%    0.00%    0.00%   99.47%
 scheduler( 1)    2.40%    0.19%   31.46%    7.20%   18.40%    0.86%   39.50%
 scheduler( 2)    2.23%    0.18%   29.31%    6.87%   18.94%    0.83%   41.64%
 scheduler( 3)    2.12%    0.22%   30.32%    7.40%   19.26%    0.73%   39.96%
 scheduler( 4)    2.27%    0.19%   30.15%    7.50%   19.07%    0.78%   40.04%
 scheduler( 5)    2.34%    0.19%   31.03%    7.28%   19.27%    0.84%   39.03%
 scheduler( 6)    2.30%    0.17%   29.93%    7.42%   18.81%    0.68%   40.68%
 scheduler( 7)    2.45%    0.17%   29.16%    7.50%   18.91%    0.70%   41.10%
 scheduler( 8)    2.22%    0.18%   30.32%    6.87%   19.04%    0.66%   40.72%
 scheduler( 9)    2.28%    0.18%   29.19%    7.28%   18.82%    0.79%   41.47%
 scheduler(10)    2.28%    0.18%   28.56%    7.11%   19.17%    0.77%   41.94%
 scheduler(11)    2.24%    0.18%   30.12%    6.96%   18.89%    0.76%   40.85%
 scheduler(12)    2.31%    0.20%   29.76%    7.29%   18.94%    0.73%   40.77%
 scheduler(13)    2.38%    0.21%   30.18%    7.06%   19.19%    0.68%   40.30%
 scheduler(14)    2.25%    0.20%   30.32%    7.28%   19.09%    0.75%   40.11%
 scheduler(15)    2.34%    0.22%   30.58%    7.39%   18.81%    0.72%   39.93%
 scheduler(16)    2.28%    0.21%   29.44%    7.29%   19.11%    0.71%   40.97%
 scheduler(17)    2.30%    0.18%   29.20%    7.43%   18.52%    0.72%   41.65%
 scheduler(18)    2.04%    0.15%   23.48%    5.65%   16.95%    0.59%   51.14%
 scheduler(19)    0.00%    0.00%    0.00%    0.00%    0.02%    0.00%   99.98%
 scheduler(20)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
 scheduler(21)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
 scheduler(22)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
 scheduler(23)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
 scheduler(24)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%

Stats per type:
         async    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
           aux    4.08%    0.34%    0.00%    0.00%    0.11%    0.00%   95.47%
dirty_cpu_sche    0.00%    0.00%    0.02%    1.40%    0.07%    0.00%   98.51%
dirty_io_sched    0.00%    0.00%   53.86%    0.00%    4.77%    0.00%   41.37%
          poll    0.00%    0.53%    0.00%    0.00%    0.00%    0.00%   99.47%
     scheduler    1.71%    0.14%   22.19%    5.37%   14.13%    0.55%   55.91%

On the"healthy" node a lot of the msacc reported CPU busyness is in other - which includes IO wait time, so accounting for that there seems to be a closer match between OS and beam reported CPU (these are 24 vCPU nodes).

Prior to reducing the scheduler counts, the problem described above got much worse in some tests - i.e. schedulers busy, but reported CPU not as high, memory use increasing.

In some cases some long-running tasks appeared to almost stop altogether, and memory usage increased rapidly to 40GB.

A summary of the current understanding of memory issues.

Observations had been made of Riak clusters rapidly escalating the amount of memory being taken by the beam process, and ending in a state with a high proportion of memory fragmentation.
There appear to be two relevant factors. This appears to be more likely to occur on clusters with large (in terms of key size) leveled-backed vnodes. This appears to occur more frequently with pure-PUT loads, for example in transfers or in cluster full-syncs to fresh clusters (with leveled-backed vnodes).
There is an issue with caches in leveled_sst files. Those caches are formed on activity, and never dropped. The larger the count of files, the more is needed by cache. If the file is largely unused (i.e. it contains stale data) this cache is a waste of resource, but it will still be formed due to SQN checks in the Journal compaction process.
There is an issue with back-pressure on PUT loads. When leveled has a backlog of ledger compaction work (i.e. the LSM tree), rather than stall, it will temporarily hold more data in memory to continue, while infrequently adding small latency to PUT responses. Leveled is bending rather than breaking, assuming pressure will eventually lift .... but it may not eventually lift and more and more memory may be required, particularly where there is not GET traffic (which naturally directs requests away from busy vnodes).
When pressure is relieved altogether, leveled without a trigger of new PUTs will continue to hold data in memory. Also there is no de-fragmentation process in the VM, so without any turnover of memory the VM's memory allocation will remain fragmented.
There may also be a state whereby the schedulers report busyness doing garbage collection, but no garbage collection appears to be occurring. This remains unexplained, but it is expected that reducing the other memory issues will reduce the frequency of such events.

The following improvements are therefore to be added to 3.0.10:

When a SQN check is processed by a leveled_sst file, it will set a timeout, and if there are no further requests at the timeout the caches will be flushed, and the process will hibernate. This means that the background journal compaction process will flush as well as build caches on infrequently accessed files, and prompt churn of memory
The frequency of "slow-offers", artificial delays on PUTs will be increased during work backlogs. This frequency will be tuneable, as will the pause length (which is already tuneable).
There will now be prompted garbage collection on both the penciller and penciller's clerk processes immediately after infrequent events which touch a large number of binaries on the shared heap.
The VM memory settings will be exposed via riak.conf, but no defaults will be changed. Currently, it is considered preferable to resolve these issues by changing the code rather than trying to tune the VM. This is not really an issue with the VM and fragmentation, the fragmentation is symptom of poor practice within the leveled code.
There MAY be a change to prompt regularly to flush memory to disk when a backlog occurs, even if no new PUTs arrive.

1 - 4 are implemented in riak 3.0.10.

Also in 3.0.10, another significant cause of memory issues in leveled was detected - https://github.com/martinsumner/leveled/pull/377.

With these fixes, fix (5) has been deferred - https://github.com/martinsumner/leveled/pull/376.

It does look very likely that memory issues are not VM-related, they are all in effect leveled bugs.

Also in 3.0.10 are a series of fixes to use disk-backed queues rather than pure memory queues to avoid another area where memory can run out of control.

Current issue on a production cluster running with 30GB in use according to erlang:memory() but > 130GB in use according to the OS.

The issue is focused on the single-block carrier threshold, with fragmentation running at < 20% across all (48 in this case) eheap allocators e.g.

{{eheap_alloc,15},

  [{sbcs_usage,0.16793810855712812},

   {mbcs_usage,0.5080543553773401},

   {sbcs_block_size,588764160},

   {sbcs_carriers_size,3505840128},

   {mbcs_block_size,42685280},

   {mbcs_carriers_size,84017152}]},

Looking to experiment with https://www.erlang.org/doc/man/erts_alloc.html#M_rsbcst - and reverse out recommendations about lowering sbct.

basho / riak_kv

Memory fragmentation #1820