econia-labs / econia

Hyper-parallelized on-chain order book for the Aptos blockchain
https://econia.dev
Other
134 stars 46 forks source link

Consider B+ map architecture #62

Open alnoki opened 1 year ago

alnoki commented 1 year ago

AVL queue background

During the time of implementation of Econia's AVL queue, Aptos' gas schedule was subject to regular changes and documentation was practically nonexistent. At the time, provided guidance indicated that gas optimality rested upon minimizing the number of bytes in a per-item operation, such that AVL queue data structures are bitpacked for a minimimized byte footprint.

However, the gas schedule has converged on a byte-wise cost schema that only applies after a critical number of bytes have been operated on, in effect rendering the first 1k bytes "free". This development essentially renders all AVL queue bitpacking operations unnecessary, to the extent that AVL queue gas computation costs are higher than required by the finalized schedule.

Moreover, with the AVL queue offering only 5-bit height tracking and 14-bit access keys, AVL queue height is limited to 18, with a maximum node count of 16383. One implication of this schema is an eviction paradigm whereby the tail of the AVL queue is evicted whenever no more nodes can be inserted.

From an order book persepctive, ideally this paradigm would minimize denial-of-service (DoS) attacks that attempt to increase gas costs by inserting excessive bogus orders: only 18 borrows max are required to search in a full tree. However, this paradigm enables a form of adversarial eviction whereby an attacker can clear out the order book by placing excessive transactions.

M-ary trees

Motivation

With the first 1k bytes of a global storage operation effectively free, and with computational costs dwarfed by storage costs, a proposed remedy is to replace the AVL queue with an m-ary tree-based structure, having at most $m$ children per node. Notably, an AVL tree is an m-ary tree with $m=2$, but in the proposed implementation, $m$ would be set to include as much data as possible in each node without exceeding 1k bytes.

Compared with a binary search tree with $n$ nodes having height on the order of $\log_2 n$, an m-ary tree has height on the order of $\log_m n$. Here, for example, with $m=5$ (up to 5 children per node) and 16383 nodes (the maximum number contained in an AVL queue), height is on the order of $\log_5 16383$ for an m-ary tree versus $\log_2 16383$ for an AVL queue:

>>> from math import log
>>> n = 2 ** 14 - 1
>>> round(log(n, 2), 2) # Binary
14.0
>>> round(log(n, 5), 2) # M-ary, m = 5
6.03

Notably, for up to $m=10$ children per node, height is on the order of only 6 for as many as 1 million orders:

>>> round(log(1_000_000, 10), 2) # M-ary, m = 10
6.0

Hence an m-ary tree essentially invalidates the need to implement eviction logic, as a malicious actor would have to place upwards of millions of orders to render order book operations prohibitively costly.

B+ trees

In particular, a B+ tree is proposed as the core m-ary tree data structure since it only contains data at leaf nodes, rendering inner nodes as compact as possible. Moreover, leaf node data in a B+ tree is sorted in a linked list, providing for much simpler indexing. With a B+ tree for both bids and asks, it may be appropriate to implement leaf nodes as a doubly linked list, to enable traversal in both ascending and descending order.

Additionally, several design methodologies are considered:

  1. Flat/partitioned: Here each order is assigned an order ID with a price and a sequence number, (bitwise complement of sequence number if a bid), similar to Econia versions 1 through 3. Each order ID is a key, and each corresponding econia::market::Order is a value. Leaf nodes contain a vector of structs each having a key and the table ID of a corresponding order. Orders are partitioned into separate table entries, and can be accessed in $O(1)$ lookup time via a "value access key" (value table ID). Indexing requires traversing the linked list and then borrowing the corresponding order in the separate data structure.
  2. Flat/compact: Similar to the flat/partitioned schema, but each leaf node contains keys and values in a vector. Here, a dual m- and l- schema will be required to specify the number of data items in a leaf. Simplifies indexing within the linked list of leaf nodes. Eliminates $O(1)$ lookup via an "access key", since the location of an Order can shift in memory due to node splitting.
  3. Nested: Institutes a B+ tree for price levels, and within each price level, another B+ tree for orders. Insertions will always push back onto an internal vector, reducing vector re-ordering operations on insertion. Potentially complicates indexing across all price levels. Allows for a potential cumulative size tracker to be implemented at each price level node, enabling easier price level indexing/slippage simulation (note that this would require updating the counter each time order size is modified).

Implementation considerations

Vector operations

Regardless of the approach, vectors will be required for storing multiple keys and "child pointers" (table indices) for each node. For example, consider an inner node storing $[1, 3, 7, 12, 15, 18]$ in a standard vector: Here, inserting $19$ requires a simple (and inexpensive) push back operation, but inserting $2$ requires an $O(n)$ operation, either by re-orering essentially the entire vector after a push back via swaps,

$$[1, 3, 7, 12, 15, 18, 2]$$

$$[1, 2, 7, 12, 15, 18, 3]$$

$$[1, 2, 3, 12, 15, 18, 7]$$

$$[1, 2, 3, 7, 15, 18, 12]$$

$$[1, 2, 3, 7, 12, 18, 15]$$

$$[1, 2, 3, 7, 12, 15, 18]$$

or by using a tmp variable approach:

$$[1, 3, 7, 12, 15, 18], tmp = 2$$

$$[1, 2, 7, 12, 15, 18], tmp = 3$$

$$[1, 2, 3, 12, 15, 18], tmp = 7$$

$$[1, 2, 3, 7, 15, 18], tmp = 12$$

$$[1, 2, 3, 7, 12, 15], tmp = 15$$

$$[1, 2, 3, 7, 12, 15, 18]$$

One solution is to use a singly linked list, but this approach eliminates the kind of binary lookup made possible during search operations.

Notably, with market fills operating on the head or tail of the leaf node doubly linked list, one potential approach is to use a static vector for inner nodes, and a doubly linked list of data items within each leaf node: in a flat/compact approach, for example, leaf node lookup makes use of a binary search within inner node vectors, and linear search at leaf nodes. Here, popping the head of the entire data structure does not require updating every element within the corresponding leaf node's internal doubly linked list.

Notably, actual gas operations may render a static vector the least expensive, due to average number of expected operations, previous/next pointer borrows and upates, etc.

In a static vector implementation, insertion operations that require breaking up a tree could implement the sorted insert during the split operation, to reduce vector lookup.

Operations may be based on Aptos' SimpleMap implementation.

Iteration

Should internal node indices be required for iteration, it is proposed that an Iterator or Index struct be used rather than passing table entry IDs, which essentially function as memory pointers.

Nested list style

In a nested approach, where price levels are generally static over time and order sequence numbers are constantly churning over, it may be practial to use a doubly linked list inside nodes for the sequence number dimension, and a static vector inside nodes for the price number dimension. With operations inside each price level typically either popping from the front or appending to the tail of the resultant price-time priority queue, operations tend toward $O(1)$ within the sequence number dimension, and $O(\log n)$ within the price level dimension (due to binary search within a price level node).

Sample implementation

See below for an analysis of a flat/partitioned sample implementation.

Structs

/// Node IDs unset at bit 63, value IDs set at bit 63.
struct BplusTree<Value> has store {
    /// Tree order: maximum number of children per node.
    order: u64
    /// Node ID of root node. `NIL` if empty.
    root: u64,
    /// Map from node ID to `Node`.
    nodes: TableWithLength<u64, Node>,
    /// Map from value ID to `Value`.
    values: TableWithLength<u64, WrappedValue<Value>,
    /// Minimum key in tree.
    min_key: u128,
    /// Maximum key in in tree.
    max_key: u128,
    /// Minimum key node ID: if included, must be updated when
    /// minimum key's node changes.
    min_key_node_id: u64,
    /// Maximum key node ID: if included, must be updated when
    /// maximum key's node changes.
    max_key_node_id: u64,
    /// Value ID for minimum key.
    min_key_value_id: u64,
    /// Value ID for maximum key.
    max_key_value_id: u64
    /// Node ID at top of inactive stack.
    node_stack_top: u64
    /// Value ID at top of inactive stack.
    value_stack_top: u64
}

struct WrappedValue<Value> has store {
    /// `NIL` if active. Else value ID of next inactive value in
    /// stack.
    next: u64,
    /// Optional value. None if inactive.
    value: Option<Value>,
    /// Insertion key, if active.
    key: u128
}

struct Node has store {
    /// `ROOT`, `INNER, `LEAF`, or `INACTIVE`.
    type: u8
    /// Node ID of parent, if any.
    parent: u64,
    /// If `LEAF`, node ID of next node in linked list, if any. If
    /// `INACTIVE`, node ID of next inactive node in stack, if any.
    next: u64
    /// Same as `next`, but for previous node in linked list.
    previous: u64
    /// Empty if inactive.
    items: vector<Item>
}

struct Item has store {
    /// Ignored if 0th item in vector for `ROOT` or `INNER` node.
    key: u128,
    /// If `LEAF`, value ID for given key. If `ROOT` or `INNER`,
    /// child node ID for given key's immediate right child
    /// (leftmost child when 0th item in vector).
    id: u64
}

Height

>>> f = int((8 + 64 * 3 + 8) / 8)
>>> f
26
>>> i = int((128 + 64) / 8)
>>> i
24

$$f + im = s$$ $$im = s - f$$ $$m = \lfloor (s - f) / i \rfloor$$

>>> s = 1024
>>> m = floor((s - f) / i)
>>> m
41
>>> for m in range(4, m + 1, 4):
...     height = log(10_000, m / 2)
...     print(f'm: {m}, height: {height}')
...
b: 4, height: 13.28771237954945
b: 8, height: 6.643856189774725
b: 12, height: 5.140388835753876
b: 16, height: 4.429237459849817
b: 20, height: 4.0
b: 24, height: 3.7065136321165078
b: 28, height: 3.490011478196624
b: 32, height: 3.3219280948873626
b: 36, height: 3.186559080787649
b: 40, height: 3.074487147360964
>>> for m in range(4, m + 1, 4):
...     height = log(10_000_000, m / 2)
...     print(f'm: {m}, height: {height}')
...
m: 4, height: 23.25349666421154
m: 8, height: 11.62674833210577
m: 12, height: 8.995680462569283
m: 16, height: 7.75116555473718
m: 20, height: 7.0
m: 24, height: 6.486398856203888
m: 28, height: 6.107520086844092
m: 32, height: 5.813374166052885
m: 36, height: 5.5764783913783855
m: 40, height: 5.380352507881686

Gas

// vec
[vec_len_base: InternalGas, "vec_len.base", 220 * MUL],
[vec_imm_borrow_base: InternalGas, "vec_imm_borrow.base", 330 * MUL],
[vec_mut_borrow_base: InternalGas, "vec_mut_borrow.base", 330 * MUL],
[vec_push_back_base: InternalGas, "vec_push_back.base", 380 * MUL],
[vec_pop_back_base: InternalGas, "vec_pop_back.base", 260 * MUL],
[vec_swap_base: InternalGas, "vec_swap.base", 300 * MUL],
[vec_pack_base: InternalGas, "vec_pack.base", 600 * MUL],
[
    vec_pack_per_elem: InternalGasPerArg,
    "vec_pack.per_elem",
    40 * MUL
],
[vec_unpack_base: InternalGas, "vec_unpack.base", 500 * MUL],
[
    vec_unpack_per_expected_elem: InternalGasPerArg,
    "vec_unpack.per_expected_elem",
    40 * MUL
],

Storage gas costs are defined thusly:

let item_config = UsageGasConfig {
    target_usage: 2 * k * m, // 2 billion
    read_curve: base_8192_exponential_curve(300 * k, 300 * k * 100),
    create_curve: base_8192_exponential_curve(5 * m, 5 * m * 100),
    write_curve: base_8192_exponential_curve(300 * k, 300 * k * 100),
};
crate::natives::define_gas_parameters_for_natives!(GasParameters, "table", [
    // Note(Gas): These are legacy parameters for loading from storage so they do not
    //            need to be multiplied.
    [.common.load_base, "common.load.base", 8000],
    [.common.load_per_byte, "common.load.per_byte", 1000],
    [.common.load_failure, "common.load.failure", 0],

    [.new_table_handle.base, "new_table_handle.base", 1000 * MUL],

    [.add_box.base, "add_box.base", 1200 * MUL],
    [.add_box.per_byte_serialized, "add_box.per_byte_serialized", 10 * MUL],

    [.borrow_box.base, "borrow_box.base", 1200 * MUL],
    [.borrow_box.per_byte_serialized, "borrow_box.per_byte_serialized", 10 * MUL],

    [.contains_box.base, "contains_box.base", 1200 * MUL],
    [.contains_box.per_byte_serialized, "contains_box.per_byte_serialized", 10 * MUL],

    [.remove_box.base, "remove_box.base", 1200 * MUL],
    [.remove_box.per_byte_serialized, "remove_box.per_byte_serialized", 10 * MUL],

    [.destroy_empty_box.base, "destroy_empty_box.base", 1200 * MUL],

    [.drop_unchecked_box.base, "drop_unchecked_box.base", 100 * MUL],
]);
>>> for m in range(4, m + 1, 4):
...     height = log(1000, m / 2)
...     print(f'm: {m}, height: {height}')
... 
m: 4, height: 9.965784284662087
m: 8, height: 4.9828921423310435
m: 12, height: 3.855291626815406
m: 16, height: 3.3219280948873626
m: 20, height: 2.9999999999999996
m: 24, height: 2.7798852240873804
m: 28, height: 2.617508608647468
m: 32, height: 2.4914460711655217
m: 36, height: 2.3899193105907366
m: 40, height: 2.3058653605207224

Conclusions

@Arinerron @chad-mcdonald @chen-robert @fcremo

alnoki commented 1 year ago

Flat/compact implementation

Structs


/// A storage-efficient sorted map based on a B+ tree.
///
/// Inner Node IDs unset at bit 63, leaf node IDs set at bit 63.
struct BPlusMap<V> has store {
    /// Maximum number of children, $m$, to an inner node.
    inner_node_order: u8,
    /// Maximum number of key-value pairs, $l$, in a leaf node.
    leaf_node_order: u8,
    /// `NIL` if empty, otherwise node ID of root node, which may be
    /// an inner node or a leaf node.
    root: u64,
    /// Minimum key in map, ignored if empty.
    min_key: u128,
    /// Maximum key in map, ignored if empty.
    max_key: u128,
    /// Minimum key's leaf node ID, `NIL` if empty. The head of the
    /// doubly linked list of leaf nodes.
    head_node_id: u64,
    /// Maximum key's leaf node ID, `NIL` if empty. The tail of the
    /// doubly linked list of leaf nodes.
    tail_node_id: u64,
    /// Map from inner node ID to inner node.
    inner_nodes: TableWithLength<u64, InnerNode>,
    /// Map from leaf node ID to leaf node.
    leaf_nodes: TableWithLength<u64, LeafNode<V>,
    /// Inner node ID at top of inactive stack. `NIL` if no inactive
    /// nodes.
    inner_stack_top: u64,
    /// Leaf node ID at top of inactive stack. `NIL` if no inactive
    /// nodes.
    leaf_stack_top: u64
}

/// Inactive if no children, else active. Inactive inner nodes are
/// stored in a stack.
struct InnerNode has store {
    /// | Node status | Meaning if `NIL`   | Meaning if not `NIL` |
    /// |-------------|--------------------|----------------------|
    /// | Active      | Node is root       | Parent inner node ID |
    /// | Inactive    | Is bottom of stack | Next node in stack   |
    parent: u64,
    /// Inactive if empty, else an active node.
    children: vector<Child>
}

/// Indicates a child to an inner node, either another inner node or
/// a leaf node.
struct Child has store {
    /// The minimum key allowed in the subtree with the given node
    /// ID. Ignored if the leftmost child of an inner node.
    min_key: u128,
    /// Node ID of corresponding child.
    node_id: u64
}

/// Inactive if no pairs, else active. Active leaf nodes are stored
/// in a sorted doubly linked list (DLL), and inactive leaf nodes
/// are stored in a stack.
struct LeafNode<V> has store {
    /// | Node status | Meaning if `NIL` | Meaning if not `NIL` |
    /// |-------------|------------------|----------------------|
    /// | Active      | Node is root     | Parent inner node ID |
    /// | Inactive    | Ignored          | Ignored              |
    parent: u64,
    /// | Node status | Meaning if `NIL` | Meaning if not `NIL` |
    /// |-------------|------------------|----------------------|
    /// | Active      | Is DLL list head | Previous DLL node    |
    /// | Inactive    | Ignored          | Ignored              |
    previous: u64,
    /// | Node status | Meaning if `NIL`   | Meaning if not `NIL` |
    /// |-------------|--------------------|----------------------|
    /// | Active      | Is DLL list tail   | Next DLL node        |
    /// | Inactive    | Is bottom of stack | Next node in stack   |
    next: u64,
    /// Key-value pairs sorted by key.
    pairs: vector<Pair<V>>
}

/// A key-value pair.
struct Pair<V> has store {
    /// Inserted key.
    key: u128,
    /// Inserted and optionally mutated value.
    value: V
}

Implementation details

Unlike Aptos' SimpleMap, which relies on the computationally-intensive aptos_framework::comparator module, the above implementation relies only on the primitive integer comparators <, >, <=, >=, !=, ==, etc. This approach minimizes external function calls, but disallows for generality of the key type: ideally the data structure would support any kind of integer key, but there is presently no Move support for a generic integer type. Hence the most general u128 is chosen, with the expectation of upcasting for u8 or u64 keys.

It is proposed that public APIs mimic those of Rust's BTreeMap.

Ideally simple helper functions of the form is_leaf_node_id() and to_leaf_node_id() would support function inlining to reduce function call gas costs. If inlining is not supported, however, functions can still be written as if they were to be inlined by the compiler, then manually inlined later.

Ideally the Aptos Framework would provide getters for the number of free bytes per global storage operation. Here, public functions for the BPlusMap could accept an immutable reference to a V type, and calculate the corresponding optimal $l$ value accordingly. As for $m$, this may hinge on gas optimizations, but could potentially be derived from the free bytes amount if testing reveals that the gas cost bottleneck is how compact inner nodes are.

See Beingessner 2022 for more optimality discussions and a comparisons with Rust's BTreeMap.

Sibling pointers may additionally be helpful for inner nodes, due to sibling adoption operations.

alnoki commented 1 year ago

More implementation resources

Note that Beingessner 2022 describes a vector of edges as well as a vector of keys, a stack for tacking vector indices for each level into the tree (e.g. going back up a level to update a child pointer), and tracks the height of the tree as well as the length (number of records).

alnoki commented 1 year ago

List pointers in a level

It may be useful to include previous/next pointers in both inner and leaf nodes, for tracking the previous/next node in the doubly linked list at a given level. This could make for less costly iteration if spillover/adoption algorithms are implemented per a B*+ tree.

alnoki commented 1 year ago

Iterator approach

It is proposed that the structure issue an Iterator struct containing a leaf node index and a vector index within, to ease the iteration process by reducing search/borrow costs. Here, the Iterator shall contain fields that essentially function as pointers into the B+ map memory. As such, it is suggested that the B+ map structure include an iterator_issued: bool flag that prohibits mutations/further iterator issuance when marked true. Per this approach, there shall be no mismatch between publicly exposed indices and the intended internal targets, since operations shall be read-only when an iterator is issued:


module econia::b_plus_map {

    struct BPlusMap has store {
        iterator_issued: bool,
        // More fields...
    }

    struct Iterator {
        leaf_node_index: u64,
        vector_index: u64
    }

}

For simplicity it was considered that iteration should rely on the borrow checker to prevent mutations, but at present it is unclear if this is possible: an initial hypothetical approach involved having iterator be a field in the B+ map struct, and having a borrow_iterator() require and return a mutable reference to the B+ map, as well as to an iterator. Here, the intent was that a read-only operation would not violate the borrow checker, but a mutation operation would. Notably, however, this prohibits legitimate iterated mutations, and may prohibit read-only traversals.


module econia::b_plus_map {

    struct BPlusMap has store {
        iterator: Option<Iterator>,
        // More fields...
    }

    struct Iterator {
        //
        leaf_node_index: u64,
        vector_index: u64
    }

    fun example_calls(
        map_ref_mut: &mut BPlusMap
    ) {
        // Borrow iterator at head.
        let iterator_mut: &mut Iterator = get_iterator_at_head(map_ref_mut);
        // Borrow value of next key (may violate borrow checker).
        let next_value_ref = borrow_next(map_ref_mut, iterator_mut);
        // Get value, then iterate to next key (definitely violates
        // borrow checker).
        let value = pop_and_iterate_to_next(map_ref_mut, iterator_mut);
    }

}

Alternatively, it may be useful to still store the iterator in the BPlusMap and not return a mutable reference to it, but simply provide APIs like iterate_from_head(), iterate_from_tail(), stop_iteration(), etc, that update the state of the Iterator. Then when it is some, callers can iterate_next().

Cursor approach

Indeed this approach may be more appropriately considered a "cursor", with mutation operations flushing the cursor.


module econia::b_plus_map {

    struct BPlusMap has store {
        cursor: Option<Cursor>,
        // More fields...
    }

    struct Cursor {
        key: u128,
        leaf_node_index: u64,
        vector_index: u64
    }

}

Here, a search operation for a key updates the cursor, so that it can be borrowed against without multiple table lookups: if an operation needs to check before inserting, it can do one search, then lookup straight from the cursor in $O(1)$ time.

Operations that only alter values from a key-value pair do not need to flush the cursor, so some mutation operations may be exempt from flushing logic. Additionally, to save on function calls like option::is_some(), the node index 0 could be reserved for NIL, indicating no cursor when leaf_node_index is NIL:


module econia::b_plus_map {

    struct BPlusMap has store {
        cursor_key: u128,
        /// If `NIL` cursor is empty.
        cursor_leaf_node_index: u64,
        cursor_vector_index: u64
        // More fields...
    }

}

u256 support

Should Aptos support u256 natives, it may be most appropriate to support u256 keys as the most general case. This may lead to being able to store fewer keys in an inner node, but if vector operation costs are sufficiently high then optimality conditions for inner node order $m$ may be unaltered.

alnoki commented 1 year ago

Additional functionality considerations

Cursor structuring

For clarity, it may be useful to use the first structure above, with an option-packed Cursor inside the BPlusMap struct. Alternatively, a Cursor not packed in an option may reduce additional calls to the option module:

module econia::b_plus_map {

    struct BPlusMap has store {
        cursor: Cursor,
        // More fields...
    }

    struct Cursor {
        key: u128,
        leaf_node_index: u64,
        vector_index: u64
    }

}

Here, helper functions of the form is_cursor_active() could simply inspect the leaf node index to see if it is non-null. Additionally, the cursor could serve as a kind of cache, whereby operations first check the cursor to see if it is pointing at a relevant record to look up. Here, additional fields could be used to say the min/max permissible key inside of a leaf node, the closest match, etc:

module econia::b_plus_map {

    struct Cursor {
        /// Either found key or closest key from a search.
        key: u128
        vector_index: u64,
        /// `FOUND` or `CLOSEST`.
        key_type: bool,
        /// `NIL` if cursor inactive.
        leaf_node_index: u64,
        min_key_in_node: u128,
        max_key_in_node: u128,
        /// Minimum permissible key in node.
        node_key_floor: u128,
        /// Maximum permissible key in node.
        node_key_ceiling: u128,
    }

}

Note that spillover and rebalancing operations may make floor/ceiling fields burdensome to keep current.

Inlining

Helper functions like is_cursor_active() could potentially be inlined per aptos-core #4647, but only if they are inside of the given module: option calls reference an external module and may thus present a challenge for inlining.

Storage optimality

An official number of free bytes, along with more storage details, are expected to be revealed in the coming months per aptos-core #4625.