Make use of key ranges - Githubissues

freesig commented 3 months ago

In both sqlite and BTreeMap a range of keys with values can be retrieved from storage. What we want is to fill in the keys that aren't in storage but are within the range with empty vecs. This would be a lot faster then incrementing the key and then calling query.

There is a question around key order. When incrementing a key you get the next key within the same length. However when keys are stored in sorted order in storage a key with a longer length can be within a range. For example

[0, 0] < [0, 1]
[0, 0] < [0, 0, 1] < [0, 1]

When considering a sparse range this makes sense but if we try and make it not sparse (return empty vecs for missing keys) then the number of missing keys is effectively infinite. So it makes sense to only use ranges with the same dimensions. However this is not how they are stored. We could possibly filter out keys of different sizes in our sql or BTreeMap iterator.

freesig commented 3 months ago

@mitchmindtree Curios what your thoughts on this are?

mitchmindtree commented 3 months ago

Yeah, I think your proposal of ensuring returned key ranges have the same size makes a lot of sense. We could document this behaviour in the Pint and VM's state read docs so that contract devs are aware and can plan accordingly when they design their state layout in the case that they require its range-accessible.

I agree it makes sense to have a dedicated SQL query that queries a range of keys, but ensures the lengths match.

Key Encoding

Seeing as our keys are currently base64 encoded I'm not sure we could apply this check directly on the key itself in SQL without decoding it first (I'm not sure if SQL can do this?). If not, we could consider using hex encoding for the keys to ensure the length of the encoded keys correlates directly to their byte length.

SQL implementation?

I'm less familiar with exactly how we'd implement this in SQL, though if it's anything like BTreeMap I think a filter is reasonable? As long as we document that inserting different length keys within the lexicographical ordering of an existing key range that you want to query will increase the cost of those queries, then it can be up to the user to design their state layout in a way that's cost-effective to query.

I'm not sure exactly how we would keep track of this cost though. E.g. if a user stores [0] and [1] and some massive number of keys in between (e.g. [0, 0, 0, 1], [0, 0, 0, 2], etc), when they query the range [0]..=[1], only two values are yielded, but SQL is doing a lot more work. Maybe we'd need a way to count the total number of keys (including those skipped by filtering) and multiply that by some base cost for the StateRead::KeyRange cost to determine the actual cost based on some measurements of different key amounts, or something along those lines?

mitchmindtree commented 3 months ago

The more I think about my last comment, the less I'm sure. It would certainly be different behaviour to querying a btreemap range as they don't inherently filter. Maybe we also need a KeyRangeSparse op that also yields the key alongside each value? For these sparse, variable key length ranges? 🤔🤔🤔

freesig commented 3 months ago

SQL can compare strings or blobs lexographically but I'm not sure if base64 encoding keeps this ordering or not. We chould test it though.

As far as I can tell sql will behave exactly like btreemap.range(key..(key + 20)).filter(|(k, _)| k.len() == key.len()) but this does have the problem that the iterator or sql will potentially go through a large range of other keys before the filer or where clause.

freesig commented 3 months ago

If we don't do this and use the current approach of incrementing keys then doing a query for each key, we have essentially wasted all the benefits of a BTreeMap and may aswell be using a HashMap design.

freesig commented 3 months ago

The other option would be to have KeyRange be ValueRange and return the (key, value) for each key within the range (regardless of key length). This would mean that empty keys would not return empty values. It also means the keys that had values need to be put into the StateSlots so you know which keys you got a hit on.

freesig commented 3 months ago

In a real application things will probably have a layout like:

Vector

storage {
  a: int[],
  b b256[],
}

Would be something like: A keys: [0, 0], [0, 1], [0, 2]. B keys: [1, 0], [1, 1].

Map

storage {
  a: string => int,
  b b256 => int,
}

Would be something like: A keys: [0, -4, 5, 1], [0, 22, 88, 66, 4], [0, 9, -42]. B keys: [1, -3, 22, 9, 7], [1, 0, 0, 0, 1].

The only time the keys are different lengths is in the string to int map. Where keys can be variable length strings.

freesig commented 3 months ago

The question is what does it mean to do something like:

let vals: int[] = storage::a["abc..d"];

This seems more like you want to get the values that exist between abc andd so it should probably be:

 let vals: {string, int}[] = storage::a::iter("abc..d");

Where as

let key: b256;
let vals: int[10] = storage::b[key..(key + 10)];

This probably wants the values in the range and empties if not. Maybe it has to be:

let key: b256;
let vals: {exists: bool, value: int}[10] = storage::b[key..(key + 10)];

mitchmindtree commented 3 months ago

Contiguous and Sparse Storage

Yeah I think there are two distinct use-cases that you're hinting at with the Vector and Map comparison:

Contiguous element collections with fixed-length keys (e.g. Vecs).
Sparse element collections with variable length keys (e.g. sets or maps).

I wonder if we want two ops to efficiently support both use cases?

`StateRead::KeyRangeFixed`

I.e. the existing op we have today:

Assumes fixed key length for each element in range (filters out non-matching keys).
Values are returned to the stack along with their slot indices.
Empty keys yield empty values.
Used for collections of contiguous elements (Vecs).
Pint's StorageVec implementation would use this.

`StateRead::KeyRangeSparse`

Yields all non-empty (key, value) pairs in the given range, regardless of key length.
Keys are returned to the stack alongside their values.
Used for sparsely populated collections (maps, sets).
Pint's StorageMap implementation would use this.

Technically, KeyRangeSparse could probably solve all cases that KeyRangeFixed can (with some manual filtering of keys in ASM...), but having a dedicated KeyRangeFixed op where we use SQL to automatically filter out all non-matching key lengths would probably be vastly more efficient for contiguous collections with fixed-length keys.

freesig commented 3 months ago

Yeh this makes sense except:

Values are returned to the stack along with their slot indices.

Values aren't returned to the stack, they are returned to the state slots.

Keys are returned to the stack alongside their values.

This is where it gets tricky. We would need to modify state slots to be:

enum StateSlot {
  Kv(Key, Value),
  Value(Value),
}

Which isn't good for memory usage because we pay for the larger variant either way.

freesig commented 3 months ago

Another option might be to put the key and value into a single state slot like:

[key_len, key_0, ...key_N, val_0, ...val_n]

essential-contributions / essential-server

Make use of key ranges #112

Key Encoding

SQL implementation?

Vector

Map

Contiguous and Sparse Storage

`StateRead::KeyRangeFixed`

`StateRead::KeyRangeSparse`