Implement removal of orphan nodes

roysc commented 3 years ago

Refactors the storage of values into a separate KV store, indexed by hash(value)+key. As noted in #14 loses the benefit of deduplication, while enabling pruning of records.

Completes work from https://github.com/celestiaorg/smt/pull/24. Revises work from https://github.com/celestiaorg/smt/pull/36.

musalbas commented 3 years ago

In light of the above discussion, I'm favouring using the original PR https://github.com/celestiaorg/smt/pull/36 and changing the specification for the following reasons:

Using a path -> value store for faster gets, should actually be the responsibility of the caller, not the SMT, which should only be responsible for the tree itself. Hence, the idea of separating the state commitment library from the storage library. Adding this KV store, would cause them to be coupled, and increase complexity.
Accepting the above, adding a new path -> value store just to implement pruning is wasteful, when we can just make a breaking change to the spec.
(See discussion above.) Without introducing a breaking change or a lot of extra complexity, it doesn't seem possible to allow people to optionally disable pruning, without forcing them to use BadgerDB (the only well-known key-value-version store), or implement their own versioning, which also requires a key-value store implementation that supports efficient range queries. This is not desirable.
Structuring the tree so that values don't have ambiguity to which leaf they belong to for implementation purposes, seems like something that should be in the spec anyway.

adlerjohn commented 3 years ago

Adding a summary of some nuance discussed on Discord.

More important than breaking the specs (which is unimportant in the grand scheme of things) the real issue is that the current hash(value) -> value map, along with the key || hash(value) -> value map that replaces it in this PR (and the equivalent in #36), are redundant to the key -> value map that will be used in ADR-40 to allow for constant-time leaf reads. Redundant = space overhead that must be paid by all users of the library, not just users that want historical queries.

In other words, regardless of pruning orphaned nodes, this abstraction must be reworked from the current assumption that a single KV exists for both internal nodes and leaves to the assumption that two KVs exist, one for internal nodes and one for leaves. This second abstract coincidentally aligns exactly with the requirements for constant-time reads, and thus does not require any redundancy.

A suggested alternative would be to store key -> hash(value), then hash(value) -> value, adding a level of indirection. But this would then require reads to do two sequential disk accesses instead of one to get a leaf in constant time, which would be an enormous overhead.

Unfortunately, removing redundancy/this mandatory overhead means it becomes impossible to support historical queries, and pruning orphaned nodes must be enabled at all times.

roysc commented 3 years ago

This may be a silly question, but what about just removing Get value access entirely? The tree could just store its internal structure and value hashes, assuming only the responsibility of proof generation. That would completely avoid redundancy and leave value storage and versioning up to the caller.

tzdybal commented 3 years ago

This may be a silly question, but what about just removing Get value access entirely? The tree could just store its internal structure and value hashes, assuming only the responsibility of proof generation.

This will completely change how this library works. Right now it's a self-contained, general purpose Merkle Tree. Removing Get is a fundamental change, meaning that library can't be used "alone", without external store for values.

That would completely avoid redundancy and leave value storage and versioning up to the caller.

This will only push the problem up to the caller ;)

I agree with @adlerjohn that two KV stores should be used in library (which of course can be implemented as just prefix-buckets in single database).

roysc commented 3 years ago

This will only push the problem up to the caller ;)

Well, that's the idea. The library user can optimize for whatever they need, and the library doesn't have to try to solve for all cases. You could still provide a self-contained tree map that includes value access via one of the methods discussed. Anyway, just a thought.

musalbas commented 3 years ago

I agree that given a trade-off between duplicating data storage, remove the Get functionality, and removing the ability to access old state roots, the latter is the least undesirable.

If there's no other options, I'm in favour of having two KV stores, where one of the KV stores is a path -> value store, and dropping the functionality of accessing old state roots.

tac0turtle commented 3 years ago

What is the timeline for the proposed changes. I would like to coordinate an audit of the library soonish.

roysc commented 3 years ago

@marbar3778 changes are pushed :) I will tag the owners for re-review.

musalbas commented 3 years ago

The core PR LGMT, however there is some other things that should be do as a result of this PR:

Update the example in the readme.
Remove the API to get old roots.
Update Get so that it reads from the path->value store directly, instead of traversing the tree.

adlerjohn commented 3 years ago

I'll handle these in follow-up PRs so as to unblock @roysc. Good work on this PR!

celestiaorg / smt

Implement removal of orphan nodes #37