JigaoLuo commented 2 years ago

The vEB Layout code strongly relies on a binary tree and computes recursively: https://github.com/patmorin/arraylayout/blob/master/src/veb_array.h

However, ART on Huge Pages

has multiple fanouts due to different Node types
has to consider to locate a subtree on a Huge Page
(no pure recursion algorithm)

For uint64_t key:

Node4 Size: 56 B
Node16 Size: 160 B
Node48 Size: 656 B
Node256 Size: 2064 B

Completness of ART

ART is just hard to be complete. The ART itself is not compare-based and the empty byte slot remains if the corresponding element is not inserted into the index.

The vEB and other compare-based trees can be easily complete if the N is matched. Any missing number doesn't hurt the completeness of the index.

Height of ART

The max height of uint64_t key is 8. There is no direct dependency between depth and N.

The height (and complexity) of radix trees depends on the length of the keys but in general not on the number of elements in the tree. Two additional techniques, path compression and lazy expansion, allow ART to efficiently index long keys by collapsing nodes and thereby decreasing the tree height. Figure 17 [ART] shows that the height of ART is less than 10 in TPC-C. Figure 2 and Figure 3 [ART] are also height relevant. [START] reduces ART overall height to improve performance.

Examples

Dense

1M Dense Keys

$ ./ART 1000000 0 z 0
Number of huge pages: 6
Number of nodes: 3924
node4_num:0
node16_num:1 | [Level 0]: 1 | 
node48_num:0
node256_num:3923 | [Level 1]: 16 |  | [Level 2]: 3907 |

10M Dense Keys

$ ./ART 10000000 0 z 0
Number of huge pages: 55
Number of nodes: 39217
node4_num:0
node16_num:0
node48_num:0
node256_num:39217 | [Level 0]: 1 |  | [Level 1]: 153 |  | [Level 2]: 39063 |

100M Dense Keys

$ ./ART 100000000 0 z 0
Number of huge pages: 550
Number of nodes: 392158
node4_num:0
node16_num:1 | [Level 0]: 1 | 
node48_num:0
node256_num:392157 | [Level 1]: 6 |  | [Level 2]: 1526 |  | [Level 3]: 390625 |

Sparse

1M Sparse Keys

$ ./ART 1000000 2 z 0
Number of huge pages: 16
Number of nodes: 87975
node4_num:55076| [Level 3]: 55047 | | [Level 4]: 29 | 
node16_num:162| [Level 2]: 160 | | [Level 3]: 2 | 
node48_num:32601| [Level 2]: 32601 | 
node256_num:136 | [Level 0]: 1 |  | [Level 1]: 128 |  | [Level 2]: 7 |

10M Sparse Keys

$ ./ART 10000000 2 z 0
Number of huge pages: 127
Number of nodes: 2853947
node4_num:2759794 | [Level 3]: 2743808 | | [Level 4]: 15986 | 
node16_num:61256 | [Level 3]: 61256 | 
node48_num:0
node256_num:32897 | [Level 0]: 1 |  | [Level 1]: 128 |  | [Level 2]: 32768 |

100M Sparse Keys

$ ./ART 100000000 2 z 0
Number of huge pages: 1171
Number of nodes: 10677764
node4_num:2328081 | [Level 3]: 71159 | [Level 4]: 2256136 | [Level 5]: 786 | 
node16_num:7658388 | [Level 3]: 7658386 | | [4]: 2 | 
node48_num:658398 | [Level 3]: 658398 | 
node256_num:32897 | [Level 0]: 1 |  | [Level 1]: 128 |  | [Level 2]: 32768 |

Empirical Knowledge

Dense

Almost all nodes are Node256. The depth of the tree is less than 4. (2^(4*8)=2^32=4294967296=4294M)

Sparse

In Level 0 & Level 1 are all Node256. In the bottom level are almost no Node256. The depth of the tree is more than 4.

JigaoLuo commented 2 years ago

Huge Page Layout

Cited from [COADS]

In fact, the order of the recursive subtrees is not important; what is important is that each recursive subtree is laid out in a single segment of memory, and that these segments are stored together without gaps.

The reason I picked 0.08M (80K) Keys is the ART in this setting fits into one 2 MiB Huge Page.

Baseline: Without Huge Page

0.08M (80K) Dense Keys

$ ./ART 80000 0 z 0
Number of huge pages: 1
Number of nodes: 316
node4_num:1| [0]: 1 | 
node16_num:0
node48_num:0
node256_num:315 | [1]: 2 |  | [2]: 313 |

lookup,80000,48.623467
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
 50.97,       134.96,      1.57,       0.00,             0.00,              0.00,          0.19,      20.56, 10000000, 2.65, 1.00, 2.48

1M Dense Keys

$ ./ART 1000000 0 z 0
Number of huge pages: 6
Number of nodes: 3924
node4_num:0
node16_num:1 | [Level 0]: 1 | 
node48_num:0
node256_num:3923 | [Level 1]: 16 |  | [Level 2]: 3907 |

lookup,1000000,38.300823
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
 64.72,       138.07,      2.81,       0.00,             0.00,              0.00,          0.00,      26.10, 10000000, 2.13, 1.00, 2.48

0.08M (80K) Sparse Keys

$ ./ART 80000 2 z 0
Number of huge pages: 1
Number of nodes: 39217
node4_num:19964| [2]: 19649 | | [3]: 315 | 
node16_num:3302| [2]: 3302 | 
node48_num:0
node256_num:129 | [0]: 1 |  | [1]: 128 |

lookup,80000,21.581766
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
114.68,       147.66,      3.45,       0.00,             0.00,              0.00,          0.93,      46.32, 10000000, 1.29, 1.00, 2.48

Baseline: Random Layout on Huge Pages

0.08M (80K) Dense Keys

 lookup,80000,50.506096
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
 49.08,       129.95,      1.57,       0.00,             0.00,              0.00,          0.18,      19.79, 10000000, 2.65, 1.00, 2.48

1M Dense Keys

lookup,1000000,40.450459
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
 61.29,       133.07,      2.82,       0.00,             0.00,              0.00,          0.00,      24.72, 10000000, 2.17, 1.00, 2.48

0.08M (80K) Sparse Keys

lookup,80000,22.441458
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
110.37,       142.68,      3.45,       0.00,             0.00,              0.00,          0.91,      44.55, 10000000, 1.29, 1.00, 2.48

JigaoLuo commented 2 years ago

Start from ART with only 1 Huge Page

1. Try

A new algorithm is hard to come up with, because of:

the incompleteness of ART
different heights of subtrees
no relationship between N and any power of 2
no intuitive easy recursion

So I decided to form some layouts for the two cases above and check&benchmark them.

0.08M (80K) Dense Keys

I think that the Preorder Layout or the Level Order Layout is equivalent to the vEB Layout in this case. (Again, this depends on how to define the height of subtrees, when height is not the power of 2.)

Preorder

lookup,80000,50.474491
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
 49.10,       129.95,      1.56,       0.00,             0.00,              0.00,          0.18,      19.80, 10000000, 2.65, 1.00, 2.48

Level-order

lookup,80000,50.334205
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
 49.21,       129.94,      1.56,       0.00,             0.00,              0.00,          0.19,      19.86, 10000000, 2.64, 1.00, 2.48

1M Dense Keys

ART with 1M Dense Keys has only depth 3, but with 4~6 pages. So I try 1M Dense with these two Layouts.

Preorder

lookup,1000000,40.393894
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
 61.37,       133.06,      2.81,       0.00,             0.00,              0.00,          0.00,      24.75, 10000000, 2.17, 1.00, 2.48

Level-order

lookup,1000000,40.499750
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
 61.20,       133.06,      2.80,       0.00,             0.00,              0.00,          0.00,      24.68, 10000000, 2.17, 1.00, 2.48

0.08M (80K) Sparse Keys

Preorder

lookup,80000,22.311670
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
110.97,       142.75,      3.49,       0.00,             0.00,              0.00,          0.92,      44.81, 10000000, 1.29, 1.00, 2.48

vEB

lookup,80000,22.411528
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
110.49,       142.72,      3.41,       0.00,             0.00,              0.00,          0.91,      44.61, 10000000, 1.29, 1.00, 2.48

Conclusion So Far

I try to draw some conclusions with simple experiments so far with 1 Page. (Dense case has up to 6 pages.) I believe these experiments could reflect constructive guidelines. Cases with 1000 pages are not so different.

Dense

In the setting of dense keys, the ART has almost all Node256, which has a size of 2064 Bytes.

A normal 4KiB page can only have 1 Node256.
A 2 MiB Huge Page can have about 1000 Node256.

For such huge node, I think cache line alignment is not necessary [TODO]. For dense workload, the layout is not helpful:

the nodes is too huge for a cache line
the Huge Page is too huge for L1 cache size
L1-miss ratio is not reduced
(I don't see L3 cache is a factor, that is relevant with our layout. Sure we can prefetch some pages inside the lookup function. But this change on the lookup function is not Layout-relevant.)

I think this conclusion generalizes to dense cases with even larger N, because the number of Node256 increases.

Sparse

Recall, that the Sparse workload generates ARTs with complex node distribution.

The Preorder Layout has a slightly larger L1 miss ratio because a pure reclusive preorder layout shows no locality. That is also what I expected.
The vEB Layout has a slightly lower L1 miss because some subtrees are locally clustered. In the example with 0.08M keys, 5 subtrees are built for an ART with a height of 4. Thus each subtree has a height of 2.
However, the change of L1 miss ratio compared with Huge Page Random Layout is minor.

JigaoLuo commented 2 years ago

More Dense ART

Tree Style 1

Number of huge pages: 1
node4_num:5 | [Level 0]: 1 |  | [Level 1]: 4 | 
node16_num:0
node48_num:16| [Level 2]: 16 | 
node256_num:400 | [Level 3]: 400 |

Baseline: Random Layout

lookup,100000,23.532398
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
105.19,       172.08,      1.98,       0.00,             0.00,              0.00,          1.81,      42.49, 10000000, 1.64, 1.00, 2.48

Preorder Layout

lookup,100000,23.728901
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
104.33,       172.14,      1.89,       0.00,             0.00,              0.00,          1.79,      42.14, 10000000, 1.65, 1.00, 2.48

In level Layout

lookup,100000,23.693701
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
104.49,       172.12,      1.90,       0.00,             0.00,              0.00,          1.80,      42.20, 10000000, 1.65, 1.00, 2.48

vEB Layout

lookup,100000,23.498673
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
105.35,       172.14,      1.92,       0.00,             0.00,              0.00,          1.80,      42.55, 10000000, 1.63, 1.00, 2.48

Tree Style 2

node4_num:1 | [Level 0]: 1 | 
node16_num:2| [Level 1]: 2 | 
node48_num:25| [Level 2]: 25 | 
node256_num:1000 | [Level 3]: 1000 |

Baseline: Random Layout

lookup,200000,31.612871
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
 78.32,       163.85,      2.35,       0.00,             0.00,              0.00,          0.41,      31.63, 10000000, 2.09, 1.00, 2.48

Preorder Layout

lookup,200000,32.941871
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
 75.20,       163.85,      2.30,       0.00,             0.00,              0.00,          0.39,      30.35, 10000000, 2.18, 1.00, 2.48

Inlevel Layout

lookup,200000,32.711700
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
 75.71,       163.85,      2.30,       0.00,             0.00,              0.00,          0.41,      30.56, 10000000, 2.16, 1.00, 2.48

vEB Layout

lookup,200000,32.572210
cycles, instructions, L1-misses, LLC-misses, dTLB-load-misses, dTLB-store-misses, branch-misses, task-clock,    scale,  IPC, CPUs,  GHz 
 76.02,       163.85,      2.30,       0.00,             0.00,              0.00,          0.42,      30.69, 10000000, 2.16, 1.00, 2.48

JigaoLuo commented 2 years ago

Prefetching

Alignment

Cacheline

Node256 has size 2064 B, which is multiple of cacheline 64 B. I think it is not necessary to have each Node256 at a 64B-aligned address.

Cite from report from repo:

veb: For very large $n$, I expected this to be the second fastest, for all the same reasons as the btree, except that the constant $1/4$ is wrong because veb subtrees are not aligned with cache line boundaries (they have size $2^h-1$) and they're not always close to the right size. For example, a subtree of size 31 gets split into a top tree of size 7 and bottom subtrees of size 7, missing the magical 16 by a wide margin.

COADS: Cache-Oblivious Algorithms and Data Structures http://erikdemaine.org/papers/BRICS2002/paper.pdf

Are other layouts possible

Sorted Layout: Eytzinger Layout: No. I just see no heap setting such as 2*i+1 and 2*i+2 in ART. (BTree with Internal Eytzinger Layout) B Tree Layout:

Inorder Traversal & Preorder Traversal & Postorder Traversal & Level Order Traversal: Not cache friendly.

JigaoLuo commented 2 years ago

The vEB layout was proposed by Prokop [Section 10.2]: http://supertech.csail.mit.edu/papers/Prokop99.pdf

Memory Layouts for Binary Search: https://cglab.ca/~morin/misc/arraylayout/, https://cglab.ca/~morin/misc/arraylayout-v2/, https://news.ycombinator.com/item?id=10410676, https://github.com/patmorin/arraylayout, https://github.com/patmorin/arraylayout/blob/master/src/arraylayouts.ipynb

CACHE-OBLIVIOUS B-TREES: http://erikdemaine.org/papers/CacheObliviousBTrees_SICOMP/paper.pdf, http://erikdemaine.org/papers/CacheObliviousBTrees_SICOMP/, http://erikdemaine.org/papers/FOCS2000b/

Cache Oblivious Search Trees via Binary Trees of Small Height: https://cs.au.dk/~gerth/papers/soda02.pdf

Cache-Oblivious Algorithms[No VEB]: https://cs.uwaterloo.ca/~imunro/cs840/Notes16/frigo.pdf

https://jiahai-feng.github.io/posts/cache-oblivious-algorithms/

https://en.algorithmica.org/hpc/external-memory/oblivious/

Beyond binary search: parallel in-place construction of implicit search tree layouts: http://www2.hawaii.edu/~berneyk/publications/thesis.pdf, http://www2.hawaii.edu/~berneyk/publications/18-ipdps.pdf, https://github.com/algoparc/Tree-Layouts

Preserving order in a forest in less than logarithmic time: https://ir.cwi.nl/pub/6890

Design and implementation of an efficient priority queue: https://www.academia.edu/18316334/Design_and_implementation_of_an_efficient_priority_queue https://www.semanticscholar.org/paper/Design-and-implementation-of-an-efficient-priority-Boas-Kaas/c1a0525addef6262ecfb11339432c9d3405a8ed7