attic-labs / noms

The versioned, forkable, syncable database
Apache License 2.0
7.45k stars 267 forks source link

Some notes and questions on Prolly Trees #3878

Closed mikeal closed 3 years ago

mikeal commented 3 years ago

Heya, great work!

We had been looking for a deterministic hash linked b-tree for some time. Since we hadn’t found your work until recently we did a bunch of our own research and arrived at an almost identical solution. As such, there’s a few notes I should probably share with you.

The first, and probably the most important, is that using a rolling hash as the primary way of finding chunk boundaries is less than ideal and introduces unnecessary inefficiencies. We also started with a rolling hash, the original thought process for the trees was to treat it “like a chunking problem” and therefor apply a rabin chunker to it. But we quickly realized that there was a better method.

If you simply take the entire entry (key and value) and hash them you have a deterministic and securely randomized binary. You don’t need to “roll” anything, you get a fully deterministic identifier you can convert to an integer that is probabilistically distributed across the entire binary address space. Unlike a rolling hash it won’t be effected by changes to entries near it. The math for causing a split is easy, and probably identical to what you are already doing, you just divide MAX_INTEGER by the target size of the chunks.

The more consistent the chunk boundaries the less chunk merges you’ll encounter when mutating the tree, so a fully deterministic hash function is preferable.

Now, we’re working on some use cases that have very different trust relationships than you do, so we’re actually worried about a potential attack on the tree that would aggressively blow up the block size, but we’re looking to solve that with a secondary rolling hash with a linearly closing window that is only applied to the back of a chunk after it reaches a certain size. But you probably don’t need to worry about that, and a rolling hash alone won’t even solve it for you, an attacker can just as easily craft entries it knows won’t cause a split in a rolling hash as they would an atomic hash.

The next thing I should share is something I realized while implementing the trees. Mutations occur more often in the top of the tree, with an increasing probability from the leaves to the root where it reaches a 100% likelihood of mutation on any given change.

Since you have to mutate trees from the leaves on up you can actually adjust the branching factor as you work your way up the tree as you build it. You don’t have to use the same branching factor for leaves and branches, you can make the leaves rather large and then adjust the branching as you work your way up the tree until you hit a cutoff point. This lets you reduce the size of the branches towards the top of the tree which brings down the size of the blocks you orphan as you mutate the tree.

Finally, I’ve actually implemented a SQL database with about 40% of the total SQL grammar. Reading your docs I was left with a few unanswered questions:

Why do you require a primary key? You can implement a table without primary keys as a sparse array, which can be done easily with the ordered map you already have by just using integer keys. I bring this up because there are some amazing optimizations you can do when appending to a traditional table. If you have a large batch of inserts you can actually partition the batch and create trees concurrently, then stitch then together with a modest amount of computation, but this use case is somewhat off limits when you only support tables with a primary key.

What do your transaction data structures look like? I couldn’t find any docs on this particular point. I ask because I recently re-designed how these work in IPSQL (here’s a very new and incomplete doc on how they work). We’ve run into a lot of problems in other projects as we layer replication operations over large graphs. Someone has to do the traversal work and even if you have a graph you’re comparing against in order to produce a delta the work required to generate a delta can be quite large. This is especially a problem in databases because every mutation is going to hit the leaf of nearly every tree, which means generating a delta is always going to be a full depth traversal (with the required round trips) so even if you do it concurrently you’re going to have to eat the latency. I started to think it would be nice if the transactions could actually represent the traversal work so that anyone who had a transaction wouldn’t need to do it again.

My solution (SQL proofs) make every transaction (read AND write) generate a data structure that describes the blocks that were required for read and write. Re-using the same trees you’re already quite familiar with we can create a Set() for the block addresses. These Sets can be compared against each other much much faster than a traversal over the database graph and you can easily commute them together or generate new ones to describe cache states. This provides much faster deltas between states with limited traversal and potentially zero IO because you can serialize a set and send it between nodes. You can also put the deltas themselves in a set and serialize them around, in fact the writes in every transaction already are the representation of a delta.

mikeal commented 3 years ago

Also, I forgot to ask, why SHA2-512 instead of SHA2-256 if you’re truncating the hashes?

aboodman commented 3 years ago

I'll answer your questions but... dude I pitched prolly trees incessantly the entire time I was at PL with you.

aboodman commented 3 years ago

And certainly many many other people at protocol labs were very familiar with prolly trees.

mikeal commented 3 years ago

@aboodman dude, how did i never hear about this!!! I’ve been asking for a hash linked b-tree for 2 years, I even searched through academic papers. I even got questions from people in PL over this time like “wish i had a b-tree” and for the last 4 months I’ve been showing slides to people about how these work and nobody mentioned prolly trees.

I wouldn’t even know about them if @ribasushi hadn’t randomly found a link to doltdb in a hacker news thread.

aboodman commented 3 years ago

Dunno :). Ask any of {Juan, why, stebalien, david the researcher guy (sorry can't remember last name)...} why they didn't tell you.

Or consult this presentation https://docs.google.com/presentation/d/18zRxxI7plB0mJkhLPfo8KJ_f-tyIIPj9L4dZroyqEF0/edit?usp=sharing that I recorded and distributed while I was there.

phritz commented 3 years ago

I mean, it's not a mystery why people at PL don't remember.

aboodman commented 3 years ago

If you simply take the entire entry (key and value) and hash them you have a deterministic and securely randomized binary.

This assumes good distribution of input data. For a pathological case consider a sequence of zeroes - this would likely never hash by simply taking the cryptographic hash of the last item. The idea of the rolling hash is to decrease the chance of pathological cases (though they are still possible). We weren't concerned with malicious input, just expected input that got unlucky.

Now, we’re working on some use cases that have very different trust relationships than you do, so we’re actually worried about a potential attack on the tree that would aggressively blow up the block size, but we’re looking to solve that with a secondary rolling hash with a linearly closing window that is only applied to the back of a chunk after it reaches a certain size. But you probably don’t need to worry about that, and a rolling hash alone won’t even solve it for you, an attacker can just as easily craft entries it knows won’t cause a split in a rolling hash as they would an atomic hash.

We looked into this but it got super complex to reason about whether the algorithm was deterministic. It would be interesting to think through the threat model for something like a blockchain, I wish I'd had the chance to do that.

The next thing I should share is something I realized while implementing the trees. Mutations occur more often in the top of the tree, with an increasing probability from the leaves to the root where it reaches a 100% likelihood of mutation on any given change.

This is a good point. I guess the purpose would be to reduce write amplification? Typically in database systems there is a balancing act of reducing both iops and write amplification. In Noms, we picked a chunk size that was close to the block size of common hardware we wanted to run on, so reducing chunk size at upper levels of tree would not have an effect on real write amplification, but would increase iops, so I'm not sure it's a win. But interesting idea. (Also in practice most implementations keep the top two levels of the tree in memory).

Why do you require a primary key?

I don't understand this question. Noms only exposes data structures like maps, sets and lists. It doesn't have a notion of a table. Maybe you are thinking of Dolt?

If you have a large batch of inserts you can actually partition the batch and create trees concurrently, then stitch then together with a modest amount of computation, but this use case is somewhat off limits when you only support tables with a primary key.

Noms includes some of these optimizations, Dolt has some other ones. I agree there are so many fun opportunities. Check out NewStreamingMap, GraphBuilder, etc in this repo. Dolt also recently did a blog post about optimizations to unordered insertions they have one. There's github issues in this repo with early experiments we did to parallelize tree construction and splice subtrees together.

hat do your transaction data structures look like?

Yeah, never documented that, sorry. Pretty much like git:

https://github.com/attic-labs/noms/blob/master/go/datas/commit.go#L31

generating a delta is always going to be a full depth traversal

Prolly trees with chunk size 4k and depth 4 can store ~280TB of data. The two first levels are in memory so you're talking about 2 iops to reach any chunk. Additionally, prolly trees are ordered specifically to make computing diffs more tractable.

In our applications, the data was already local by the time this work was happening, so it's just two reads to local disk. Perhaps your applications involve network round trips?

aboodman commented 3 years ago

You may also be interested in nbs: https://github.com/attic-labs/noms/tree/master/go/nbs

It's a blockstore specifically optimized for the needs of Noms. One of the tricks it plays is that it tends to colocate chunks into bigger "blocks" based on write time. So chunks which were written together will tend to be in the same block, reducing iops.

mikeal commented 3 years ago

This assumes good distribution of input data. For a pathological case consider a sequence of zeroes - this would likely never hash by simply taking the cryptographic hash of the last item. The idea of the rolling hash is to decrease the chance of pathological cases (though they are still possible). We weren't concerned with malicious input, just expected input that got unlucky.

This isn’t an issue as long as you’re enforcing uniqueness. A lot of b-tree implementations enforce uniqueness anyway, so this isn’t a huge departure. Once the entries are guaranteed to be unique you won’t see sequential identifiers and as long as you have a good hashing function the hashes will have a guaranteed random probability of being similar to each other. If you need a data structure that requires duplicate keys, like an index, you just need to find something unique to the entry that you can use in a compound key. For instance, the column indexes in IPSQL are [ key, rowid ] rather than just key.

If you need truly duplicate keys in the tree then the rolling hash is a definitely better.

I don't understand this question. Noms only exposes data structures like maps, sets and lists. It doesn't have a notion of a table. Maybe you are thinking of Dolt?

Ya, sorry, I was bouncing back and forth between these docs and doltdb so the question was actually aimed at doltdb.

Prolly trees with chunk size 4k and depth 4 can store ~280TB of data. The two first levels are in memory so you're talking about 2 iops to reach any chunk. Additionally, prolly trees are ordered specifically to make computing diffs more tractable. In our applications, the data was already local by the time this work was happening, so it's just two reads to local disk. Perhaps your applications involve network round trips?

Good point, if you can guarantee the top of the tree is in memory this is just way less of a problem. I’m thinking of a lot of use cases where you can’t make this guarantee, or even hope for it.

I also just can’t fit as many entries in a block as you might be, CID’s are rather large and I have to be very mindful of the max block size so an index for VARCHAR(1024) ends up having a lot of small blocks because the target rate is set very conservatively.

mikeal commented 3 years ago

You may also be interested in nbs: https://github.com/attic-labs/noms/tree/master/go/nbs

Nice! I’ll definitely check this out. One of the first things I did with these trees was use them for an append-only single file database for block storage, https://github.com/mikeal/CADB , so there are probably a lot of similarities.

aboodman commented 3 years ago

This isn’t an issue as long as you’re enforcing uniqueness.

It was important to us that different nodes would always come up with same hash for the same logical changes. So if you have a set of data, and two different nodes both insert the same value to the set, then both should come up with the same hash.

If this is not a design goal, then I agree ensuring uniqueness solves it, but in that case, why do you need this complex machinery:

we’re looking to solve that with a secondary rolling hash with a linearly closing window that is only applied to the back of a chunk after it reaches a certain size.

If you have a way to ensure uniqueness of entries, then just take the cryptographic hash of each entry and be done with it?

aboodman commented 3 years ago

I'm probably missing tons of context and I don't have time to go find it now, so don't feel the need to explain this to me :). Glad to see more people investigating these tools.

mikeal commented 3 years ago

It was important to us that different nodes would always come up with same hash for the same logical changes. So if you have a set of data, and two different nodes both insert the same value to the set, then both should come up with the same hash.

I should have been clearer, the “key” needs to be unique if its a map, and the “entry” (key and value) are being hashed, so the “entry” needs to be unique if there is no key (Set). The same inserts in two nodes will produce the same hash, and you already have uniqueness for Maps, Sets, and Sparse Arrays. You can’t do true lists and queues very easily this way though.

mikeal commented 3 years ago

If you have a way to ensure uniqueness of entries, then just take the cryptographic hash of each entry and be done with it?

An attacker can often estimate the placement of an entry and its hash and most entries you produce won’t cause a split, so generating and testing them is easy. The randomness of the hash doesn’t really help you because an attack can generate data and calculate hashes easily. If you’re limiting access to the tree and you can apply a private salt to the hash that solves it, but if you’re trying to use this in something like a public blockchain and you’re taking changes from users that produce entries into the tree that can be predicted then you’re open to attack.

zachmu commented 3 years ago

Also Dolt no longer requires a primary key for tables :D

mikeal commented 3 years ago

@zachmu answers that question :) it may still say this in the docs somewhere because I read it within the last few weeks somewhere :)

aboodman commented 3 years ago

You can’t do true lists and queues very easily this way though.

Or Blobs. All of which we wanted to be able to model. We did consider the cryptographic hash of every item though, and it works fine, as you say, if the items are unique.