Discussion: Batched Anchor Data Structure

oed commented 3 years ago

Discussion for CIP-69 See PR for latest spec: #74

stbrody commented 3 years ago

@oed, can you elaborate on what exactly your question is about building the merkle tree when there are different numbers of leaf nodes?

Beyond that, I'd like to see a bit more about what exactly the entries in the bloom filter look like/how they are built. Your code example actually has that part commented out and unspecified. Are you imagining that the tags, schema, controllers, and docId all get encoded into one entry somehow? Or does one ceramic document create like 4-10 different entries into the bloom filter? If the latter, do you plan to add anything into how these entries are encoded to allow distinguishing between entries from the different input types (distinguishing an entry that is a CID based on whether it was a main docID, a schema reference docId, or a tag, for example).

Lastly, I'm a little confused as to how one would actually take advantage of the sortedness of the merkle tree, given that when you're at an intermediate node in the tree, you don't actually have any information encoded in that node about what portion of the sorted keyspace it represents, so you'd have no way to actually decide which branch of the tree to follow to get closer to the set of documents you are looking for, unless I'm missing something?

oed commented 3 years ago

can you elaborate on what exactly your question is about building the merkle tree when there are different numbers of leaf nodes?

Basically if the number of nodes is not a power of 2 then the tree will not be symmetrical. We should however always form the tree in a deterministic way, so which algorithm should we use for that? cc @simonovic86

Beyond that, I'd like to see a bit more about what exactly the entries in the bloom filter look like/how they are built.

It's an array of strings, updating the first post. Each document will create multiple entries in the filter.

If the latter, do you plan to add anything into how these entries are encoded to allow distinguishing between entries from the different input types

Good point! We need a way to distinguish between DocIDs and Schema-DocIDs at the very least!

Lastly, I'm a little confused as to how one would actually take advantage of the sortedness of the merkle tree,

Yeah since there are more than one tag this is not optimal I think. Basically if you know that a tag is the first tag of a document then you can do a binary search to effectively find the update. However, if you only know some other info you can't really take much advantage of this. Might be better options here so open to suggestions!

simonovic86 commented 3 years ago

@oed @stbrody we already cover the case of odd number of leaves. This is the standard way of constructing merkle trees if I'm not mistaken. Here's the test case: link.

stbrody commented 3 years ago

Yeah, the algorithm for building a balanced binary search tree is pretty standard and straightforward (I used to ask it as an interview question :P), I believe the only part of it that could be different implementation by implementation is when you have an even number of nodes whether you take the rightmost node in the left half as the root, or the leftmost node of the right half. Example code here. As long as we're consistent though it should be fine either way I'd think.

@oed I see you updated the description above to include prepending "tag-" and "schema-" to those bloom filter records - I'd suggest we do the same for the other entries too ("controller-did-" and "DocId-") to future-proof ourselves in case we want to add additional DIDs or DocIds into the bloom filter in the future. Adding characters to the size of the entries in the bloom filter is free, since the actual entry strings never make it into the filter, just the hashes of them do.

My concern about doing binary search on the tree is about more than having access to the right information of the document you're looking up. Even if you have all the fields of the document you are looking for, I'm still not sure how you could actually use that to search in the tree. The problem is that at the non-leaf nodes of the tree, you only have the CIDs of your left and right sub-tree nodes, the tree itself has no information about the contents of the documents covered by each subtree. To use the sortedness property of the tree, the non-leaf nodes would need to also contain information about which range of the overall sorted keyspace they cover

oed commented 3 years ago

Created a PR for this CIP, updated the spec and made this into the discussion thread. See #74 Note that this introduces the topic property in the metadata of documents, which is currently not implemented anywhere. Some work remains to add this in various places.

ceramicnetwork / CIPs

Discussion: Batched Anchor Data Structure #69