Open JigaoLuo opened 2 years ago
Cited from [COADS]
In fact, the order of the recursive subtrees is not important; what is important is that each recursive subtree is laid out in a single segment of memory, and that these segments are stored together without gaps.
The reason I picked 0.08M (80K) Keys is the ART in this setting fits into one 2 MiB Huge Page.
$ ./ART 80000 0 z 0
Number of huge pages: 1
Number of nodes: 316
node4_num:1| [0]: 1 |
node16_num:0
node48_num:0
node256_num:315 | [1]: 2 | | [2]: 313 |
lookup,80000,48.623467
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock, scale, IPC, CPUs, GHz
50.97, 134.96, 1.57, 0.00, 0.00, 0.00, 0.19, 20.56, 10000000, 2.65, 1.00, 2.48
$ ./ART 1000000 0 z 0
Number of huge pages: 6
Number of nodes: 3924
node4_num:0
node16_num:1 | [Level 0]: 1 |
node48_num:0
node256_num:3923 | [Level 1]: 16 | | [Level 2]: 3907 |
lookup,1000000,38.300823
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock, scale, IPC, CPUs, GHz
64.72, 138.07, 2.81, 0.00, 0.00, 0.00, 0.00, 26.10, 10000000, 2.13, 1.00, 2.48
$ ./ART 80000 2 z 0
Number of huge pages: 1
Number of nodes: 39217
node4_num:19964| [2]: 19649 | | [3]: 315 |
node16_num:3302| [2]: 3302 |
node48_num:0
node256_num:129 | [0]: 1 | | [1]: 128 |
lookup,80000,21.581766
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock, scale, IPC, CPUs, GHz
114.68, 147.66, 3.45, 0.00, 0.00, 0.00, 0.93, 46.32, 10000000, 1.29, 1.00, 2.48
lookup,80000,50.506096
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock, scale, IPC, CPUs, GHz
49.08, 129.95, 1.57, 0.00, 0.00, 0.00, 0.18, 19.79, 10000000, 2.65, 1.00, 2.48
lookup,1000000,40.450459
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock, scale, IPC, CPUs, GHz
61.29, 133.07, 2.82, 0.00, 0.00, 0.00, 0.00, 24.72, 10000000, 2.17, 1.00, 2.48
lookup,80000,22.441458
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock, scale, IPC, CPUs, GHz
110.37, 142.68, 3.45, 0.00, 0.00, 0.00, 0.91, 44.55, 10000000, 1.29, 1.00, 2.48
A new algorithm is hard to come up with, because of:
N
and any power of 2So I decided to form some layouts for the two cases above and check&benchmark them.
I think that the Preorder Layout or the Level Order Layout is equivalent to the vEB Layout in this case. (Again, this depends on how to define the height of subtrees, when height is not the power of 2.)
lookup,80000,50.474491
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock, scale, IPC, CPUs, GHz
49.10, 129.95, 1.56, 0.00, 0.00, 0.00, 0.18, 19.80, 10000000, 2.65, 1.00, 2.48
lookup,80000,50.334205
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock, scale, IPC, CPUs, GHz
49.21, 129.94, 1.56, 0.00, 0.00, 0.00, 0.19, 19.86, 10000000, 2.64, 1.00, 2.48
ART with 1M Dense Keys has only depth 3, but with 4~6 pages. So I try 1M Dense with these two Layouts.
lookup,1000000,40.393894
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock, scale, IPC, CPUs, GHz
61.37, 133.06, 2.81, 0.00, 0.00, 0.00, 0.00, 24.75, 10000000, 2.17, 1.00, 2.48
lookup,1000000,40.499750
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock, scale, IPC, CPUs, GHz
61.20, 133.06, 2.80, 0.00, 0.00, 0.00, 0.00, 24.68, 10000000, 2.17, 1.00, 2.48
lookup,80000,22.311670
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock, scale, IPC, CPUs, GHz
110.97, 142.75, 3.49, 0.00, 0.00, 0.00, 0.92, 44.81, 10000000, 1.29, 1.00, 2.48
lookup,80000,22.411528
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock, scale, IPC, CPUs, GHz
110.49, 142.72, 3.41, 0.00, 0.00, 0.00, 0.91, 44.61, 10000000, 1.29, 1.00, 2.48
I try to draw some conclusions with simple experiments so far with 1 Page. (Dense case has up to 6 pages.) I believe these experiments could reflect constructive guidelines. Cases with 1000 pages are not so different.
In the setting of dense keys, the ART has almost all Node256, which has a size of 2064 Bytes.
For such huge node, I think cache line alignment is not necessary [TODO]. For dense workload, the layout is not helpful:
lookup
function. But this change on the lookup
function is not Layout-relevant.)I think this conclusion generalizes to dense cases with even larger N
, because the number of Node256 increases.
Recall, that the Sparse workload generates ARTs with complex node distribution.
Number of huge pages: 1
node4_num:5 | [Level 0]: 1 | | [Level 1]: 4 |
node16_num:0
node48_num:16| [Level 2]: 16 |
node256_num:400 | [Level 3]: 400 |
lookup,100000,23.532398
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock, scale, IPC, CPUs, GHz
105.19, 172.08, 1.98, 0.00, 0.00, 0.00, 1.81, 42.49, 10000000, 1.64, 1.00, 2.48
lookup,100000,23.728901
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock, scale, IPC, CPUs, GHz
104.33, 172.14, 1.89, 0.00, 0.00, 0.00, 1.79, 42.14, 10000000, 1.65, 1.00, 2.48
lookup,100000,23.693701
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock, scale, IPC, CPUs, GHz
104.49, 172.12, 1.90, 0.00, 0.00, 0.00, 1.80, 42.20, 10000000, 1.65, 1.00, 2.48
lookup,100000,23.498673
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock, scale, IPC, CPUs, GHz
105.35, 172.14, 1.92, 0.00, 0.00, 0.00, 1.80, 42.55, 10000000, 1.63, 1.00, 2.48
node4_num:1 | [Level 0]: 1 |
node16_num:2| [Level 1]: 2 |
node48_num:25| [Level 2]: 25 |
node256_num:1000 | [Level 3]: 1000 |
lookup,200000,31.612871
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock, scale, IPC, CPUs, GHz
78.32, 163.85, 2.35, 0.00, 0.00, 0.00, 0.41, 31.63, 10000000, 2.09, 1.00, 2.48
lookup,200000,32.941871
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock, scale, IPC, CPUs, GHz
75.20, 163.85, 2.30, 0.00, 0.00, 0.00, 0.39, 30.35, 10000000, 2.18, 1.00, 2.48
lookup,200000,32.711700
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock, scale, IPC, CPUs, GHz
75.71, 163.85, 2.30, 0.00, 0.00, 0.00, 0.41, 30.56, 10000000, 2.16, 1.00, 2.48
lookup,200000,32.572210
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock, scale, IPC, CPUs, GHz
76.02, 163.85, 2.30, 0.00, 0.00, 0.00, 0.42, 30.69, 10000000, 2.16, 1.00, 2.48
Node256 has size 2064 B, which is multiple of cacheline 64 B. I think it is not necessary to have each Node256 at a 64B-aligned address.
Cite from report from repo:
- veb: For very large $n$, I expected this to be the second fastest, for all the same reasons as the btree, except that the constant $1/4$ is wrong because veb subtrees are not aligned with cache line boundaries (they have size $2^h-1$) and they're not always close to the right size. For example, a subtree of size 31 gets split into a top tree of size 7 and bottom subtrees of size 7, missing the magical 16 by a wide margin.
COADS: Cache-Oblivious Algorithms and Data Structures http://erikdemaine.org/papers/BRICS2002/paper.pdf
Sorted Layout:
Eytzinger Layout: No. I just see no heap setting such as 2*i+1
and 2*i+2
in ART. (BTree with Internal Eytzinger Layout)
B Tree Layout:
Inorder Traversal & Preorder Traversal & Postorder Traversal & Level Order Traversal: Not cache friendly.
The vEB layout was proposed by Prokop [Section 10.2]: http://supertech.csail.mit.edu/papers/Prokop99.pdf
Memory Layouts for Binary Search: https://cglab.ca/~morin/misc/arraylayout/, https://cglab.ca/~morin/misc/arraylayout-v2/, https://news.ycombinator.com/item?id=10410676, https://github.com/patmorin/arraylayout, https://github.com/patmorin/arraylayout/blob/master/src/arraylayouts.ipynb
CACHE-OBLIVIOUS B-TREES: http://erikdemaine.org/papers/CacheObliviousBTrees_SICOMP/paper.pdf, http://erikdemaine.org/papers/CacheObliviousBTrees_SICOMP/, http://erikdemaine.org/papers/FOCS2000b/
Cache Oblivious Search Trees via Binary Trees of Small Height: https://cs.au.dk/~gerth/papers/soda02.pdf
Cache-Oblivious Algorithms[No VEB]: https://cs.uwaterloo.ca/~imunro/cs840/Notes16/frigo.pdf
https://jiahai-feng.github.io/posts/cache-oblivious-algorithms/
https://en.algorithmica.org/hpc/external-memory/oblivious/
Beyond binary search: parallel in-place construction of implicit search tree layouts: http://www2.hawaii.edu/~berneyk/publications/thesis.pdf, http://www2.hawaii.edu/~berneyk/publications/18-ipdps.pdf, https://github.com/algoparc/Tree-Layouts
Preserving order in a forest in less than logarithmic time: https://ir.cwi.nl/pub/6890
Design and implementation of an efficient priority queue: https://www.academia.edu/18316334/Design_and_implementation_of_an_efficient_priority_queue https://www.semanticscholar.org/paper/Design-and-implementation-of-an-efficient-priority-Boas-Kaas/c1a0525addef6262ecfb11339432c9d3405a8ed7
https://www.quora.com/How-is-a-van-Emde-Boas-tree-used-in-practice
https://11011110.github.io/blog/2010/01/14/van-emde-boas.html
https://news.ycombinator.com/item?id=9775920
The vEB Layout code strongly relies on a binary tree and computes recursively: https://github.com/patmorin/arraylayout/blob/master/src/veb_array.h
However, ART on Huge Pages
For
uint64_t
key:Completness of ART
ART is just hard to be complete. The ART itself is not compare-based and the empty byte slot remains if the corresponding element is not inserted into the index.
The vEB and other compare-based trees can be easily complete if the
N
is matched. Any missing number doesn't hurt the completeness of the index.Height of ART
The max height of
uint64_t
key is 8. There is no direct dependency between depth andN
.The height (and complexity) of radix trees depends on the length of the keys but in general not on the number of elements in the tree. Two additional techniques, path compression and lazy expansion, allow ART to efficiently index long keys by collapsing nodes and thereby decreasing the tree height. Figure 17 [ART] shows that the height of ART is less than 10 in TPC-C. Figure 2 and Figure 3 [ART] are also height relevant. [START] reduces ART overall height to improve performance.
Examples
Dense
1M Dense Keys
10M Dense Keys
100M Dense Keys
Sparse
1M Sparse Keys
10M Sparse Keys
100M Sparse Keys
Empirical Knowledge
Dense
Almost all nodes are Node256. The depth of the tree is less than 4. (2^(4*8)=2^32=4294967296=4294M)
Sparse
In Level 0 & Level 1 are all Node256. In the bottom level are almost no Node256. The depth of the tree is more than 4.