acts-project / vecmem

Vectorised data model base and helper classes.
https://acts-project.github.io/vecmem/
Mozilla Public License 2.0
19 stars 13 forks source link

Make several improvements to binary page allocator #299

Closed stephenswat closed 1 month ago

stephenswat commented 1 month ago

This commit makes two main changes to improve the performance of the binary page memory resource.

Firstly, it makes the allocator more eager to unsplit (i.e. merge) pages so that they can be reused for new allocation, which should reduce the memory footprint of when using very large allocations.

Secondly, it sets a minimum size of allocations per superpage, which depends on the size of the superpage. This prevents very small allocations from ending up in massive blocks (increasing fragmentation) and also decreases the total number of pages (increasing performance).

stephenswat commented 1 month ago

Performance drops a fair bit for low allocation sizes and increases for larger ones:

Before:

BenchmarkBinaryPage/1                         12.6 ns         12.5 ns     54232577
BenchmarkBinaryPage/2                         12.4 ns         12.4 ns     56375264
BenchmarkBinaryPage/4                         12.8 ns         12.8 ns     54276274
BenchmarkBinaryPage/8                         13.2 ns         13.2 ns     53022963
BenchmarkBinaryPage/16                        13.1 ns         13.1 ns     52296277
BenchmarkBinaryPage/32                        12.8 ns         12.8 ns     54326859
BenchmarkBinaryPage/64                        13.5 ns         13.5 ns     51564964
BenchmarkBinaryPage/128                       13.7 ns         13.7 ns     50968182
BenchmarkBinaryPage/256                       14.1 ns         14.1 ns     49778321
BenchmarkBinaryPage/512                       14.5 ns         14.5 ns     48230900
BenchmarkBinaryPage/1024                      14.9 ns         14.9 ns     46971943
BenchmarkBinaryPage/2048                      14.9 ns         14.9 ns     46973955
BenchmarkBinaryPage/4096                      15.0 ns         15.0 ns     46759439
BenchmarkBinaryPage/8192                      15.0 ns         15.0 ns     46865474
BenchmarkBinaryPage/16384                     14.9 ns         14.9 ns     47017818
BenchmarkBinaryPage/32768                     14.9 ns         14.9 ns     47075989
BenchmarkBinaryPage/65536                     13.4 ns         13.4 ns     46950947
BenchmarkBinaryPage/131072                    12.8 ns         12.8 ns     54544558
BenchmarkBinaryPage/262144                    12.8 ns         12.8 ns     54600546
BenchmarkBinaryPage/524288                    12.8 ns         12.8 ns     54622645
BenchmarkBinaryPage/1048576                   13.2 ns         13.2 ns     53119672
BenchmarkBinaryPage/2097152                   13.6 ns         13.6 ns     51571160
BenchmarkBinaryPage/4194304                   13.6 ns         13.6 ns     51437233
BenchmarkBinaryPage/8388608                   14.0 ns         14.0 ns     50019719
BenchmarkBinaryPage/16777216                  13.9 ns         13.9 ns     50611850
BenchmarkBinaryPage/33554432                  13.6 ns         13.6 ns     52128892
BenchmarkBinaryPage/67108864                  11.4 ns         11.4 ns     50334402
BenchmarkBinaryPage/134217728                 11.1 ns         11.1 ns     63165162
BenchmarkBinaryPage/268435456                 11.5 ns         11.5 ns     60714333
BenchmarkBinaryPage/536870912                 11.5 ns         11.5 ns     60770764
BenchmarkBinaryPage/1073741824                11.9 ns         11.9 ns     58579526
BenchmarkBinaryPage/2147483648                12.2 ns         12.2 ns     55850644
BenchmarkBinaryPage/4294967296                12.4 ns         12.4 ns     54779783

After:

-------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations
-------------------------------------------------------------------------
BenchmarkBinaryPage/1                 167 ns          167 ns      4104743
BenchmarkBinaryPage/2                 169 ns          169 ns      4009150
BenchmarkBinaryPage/4                 170 ns          170 ns      3971936
BenchmarkBinaryPage/8                 169 ns          169 ns      4094620
BenchmarkBinaryPage/16                171 ns          171 ns      3965085
BenchmarkBinaryPage/32                170 ns          170 ns      4066688
BenchmarkBinaryPage/64                170 ns          170 ns      3978674
BenchmarkBinaryPage/128               176 ns          176 ns      4061772
BenchmarkBinaryPage/256               178 ns          178 ns      4004071
BenchmarkBinaryPage/512               186 ns          186 ns      3645495
BenchmarkBinaryPage/1024              181 ns          181 ns      3873270
BenchmarkBinaryPage/2048              180 ns          180 ns      3828041
BenchmarkBinaryPage/4096              185 ns          185 ns      3845810
BenchmarkBinaryPage/8192              111 ns          110 ns      6328662
BenchmarkBinaryPage/16384            66.1 ns         66.1 ns     10167867
BenchmarkBinaryPage/32768            45.7 ns         45.7 ns     15481878
BenchmarkBinaryPage/65536            33.4 ns         33.4 ns     20748067
BenchmarkBinaryPage/131072           24.8 ns         24.8 ns     28547759
BenchmarkBinaryPage/262144           18.4 ns         18.4 ns     38280071
BenchmarkBinaryPage/524288           14.2 ns         14.2 ns     48774345
BenchmarkBinaryPage/1048576          10.2 ns         10.2 ns     69163315
BenchmarkBinaryPage/2097152          10.4 ns         10.4 ns     67143905
BenchmarkBinaryPage/4194304          10.5 ns         10.5 ns     66245786
BenchmarkBinaryPage/8388608          10.7 ns         10.7 ns     64700648
BenchmarkBinaryPage/16777216         11.1 ns         11.1 ns     64294237
BenchmarkBinaryPage/33554432         11.3 ns         11.3 ns     61304237
BenchmarkBinaryPage/67108864         12.2 ns         12.2 ns     49836666

Note that this benchmark is very antithetical to how normal allocations go, because it constantly allocates a piece of memory and then deallocates it; i.e. no more than 1 allocation ever lives at the same time. Real world performance should be much less impacted.