JigaoLuo / duckdb

DuckDB is an in-process SQL OLAP Database Management System
http://www.duckdb.org
MIT License
2 stars 0 forks source link

Benchmark: Prof. Leis's ART with different allocators. #29

Closed JigaoLuo closed 2 years ago

JigaoLuo commented 2 years ago

In this benchmark, I have tested the ART with 10M insertions.

Original Prof. Leis's ART

~/jigao/duckdb3/third_party/ART$ g++ -O3 -o orginalART orginalART.cpp 
~/jigao/duckdb3/third_party/ART$ ./orginalART 10000000 0
insert,10000000,44.629134
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
 70.64,       196.10,      0.34,       0.01,             0.01,              0.01,          0.03,      21.80, 10000000, 2.78, 1.00, 3.24 

lookup,10000000,99.911006
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
 32.74,       127.08,      0.25,       0.01,             0.00,              0.00,          0.00,      10.00, 10000000, 3.88, 1.00, 3.27 
JigaoLuo commented 2 years ago

ART with std::allocator

~/jigao/duckdb3/third_party/ART$ g++ -O3 -o ART ART.cpp 
~/jigao/duckdb3/third_party/ART$ ./ART 10000000 0
Node4 Size: 56
Node16 Size: 160
Node48 Size: 656
Node256 Size: 2064
insert,10000000,46.728975
cycles, instructions, L1-misses, LLC-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
 68.68,       200.81,      0.32,       0.15,          0.03,      19.12, 10000000, 2.92, 1.00, 3.59 

lookup,10000000,94.083193

ART with MallocAllocator

~/jigao/duckdb3/third_party/ART$ g++ -O3 -o ART ART.cpp 
~/jigao/duckdb3/third_party/ART$ ./ART 10000000 0
Node4 Size: 56
Node16 Size: 160
Node48 Size: 656
Node256 Size: 2064
insert,10000000,45.377185
cycles, instructions, L1-misses, LLC-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
 70.60,       201.07,      0.32,       0.15,          0.03,      19.66, 10000000, 2.85, 1.00, 3.59 

lookup,10000000,94.054711

ART with PoolAllocator

~/jigao/duckdb3/third_party/ART$ g++ -O3 -o ART ART.cpp 
~/jigao/duckdb3/third_party/ART$ ./ART 10000000 0
Node4 Size: 56
Node16 Size: 160
Node48 Size: 656
Node256 Size: 2064
insert,10000000,46.660719
cycles, instructions, L1-misses, LLC-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
 69.38,       196.73,      0.32,       0.15,          0.02,      19.32, 10000000, 2.84, 1.00, 3.59 

lookup,10000000,87.280597

ART with MemoryPool

~/jigao/duckdb3/third_party/ART$ g++ -O3 -o ART ART.cpp 
~/jigao/duckdb3/third_party/ART$ ./ART 10000000 0
Node4 Size: 56
Node16 Size: 160
Node48 Size: 656
Node256 Size: 2064
insert,10000000,45.493587
cycles, instructions, L1-misses, LLC-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
 71.29,       195.52,      0.37,       0.15,          0.03,      19.85, 10000000, 2.74, 1.00, 3.59 

lookup,10000000,91.222563
JigaoLuo commented 2 years ago

Valuebale Readings:

The success or failure of huge page allocation depends on the amount of physically contiguous memory that is present in system at the time of the allocation attempt. If the kernel is unable to allocate huge pages from some nodes in a NUMA system, it will attempt to make up the difference by allocating extra pages on other nodes with sufficient available contiguous memory, if any.

Other helpful posts:

Example Programs:

JigaoLuo commented 2 years ago

ART with mmap_allcator

2MB Page: mmap_allocator<uint8_t, page_type::huge_2mb, 0> allocator; with dummy bookkeeping

dummy bookkeeping: just allocate new page, if the previous allocated page is used up or not large enough for the new node.

$ echo 100 | sudo tee /proc/sys/vm/nr_hugepages
100
$ grep Huge /proc/meminfo
AnonHugePages:      2048 kB
ShmemHugePages:        0 kB
HugePages_Total:   100000
HugePages_Free:    100000
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
$ g++ ART.cpp -o ARTnew -O3 -lnuma 
$ $ ./ARTnew 10000000 0
Node4 Size: 56
Node16 Size: 160
Node48 Size: 656
Node256 Size: 2064
insert,10000000,41.141602
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
 64.51,       185.81,      0.48,       0.01,             0.00,              0.00,          0.03,      23.76, 10000000, 2.88, 1.00, 2.71 

lookup,10000000,102.682025
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
 31.82,       124.97,      0.26,       0.01,             0.00,              0.00,          0.00,       9.73, 10000000, 3.93, 1.00, 3.27

Insertion with performance improvements. But overall speed-up is less than 10 %.