Closed roysc closed 3 years ago
In light of the above discussion, I'm favouring using the original PR https://github.com/celestiaorg/smt/pull/36 and changing the specification for the following reasons:
Adding a summary of some nuance discussed on Discord.
More important than breaking the specs (which is unimportant in the grand scheme of things) the real issue is that the current hash(value) -> value
map, along with the key || hash(value) -> value
map that replaces it in this PR (and the equivalent in #36), are redundant to the key -> value
map that will be used in ADR-40 to allow for constant-time leaf reads. Redundant = space overhead that must be paid by all users of the library, not just users that want historical queries.
In other words, regardless of pruning orphaned nodes, this abstraction must be reworked from the current assumption that a single KV exists for both internal nodes and leaves to the assumption that two KVs exist, one for internal nodes and one for leaves. This second abstract coincidentally aligns exactly with the requirements for constant-time reads, and thus does not require any redundancy.
A suggested alternative would be to store key -> hash(value)
, then hash(value) -> value
, adding a level of indirection. But this would then require reads to do two sequential disk accesses instead of one to get a leaf in constant time, which would be an enormous overhead.
Unfortunately, removing redundancy/this mandatory overhead means it becomes impossible to support historical queries, and pruning orphaned nodes must be enabled at all times.
This may be a silly question, but what about just removing Get
value access entirely? The tree could just store its internal structure and value hashes, assuming only the responsibility of proof generation. That would completely avoid redundancy and leave value storage and versioning up to the caller.
This may be a silly question, but what about just removing
Get
value access entirely? The tree could just store its internal structure and value hashes, assuming only the responsibility of proof generation.
This will completely change how this library works. Right now it's a self-contained, general purpose Merkle Tree. Removing Get
is a fundamental change, meaning that library can't be used "alone", without external store for values.
That would completely avoid redundancy and leave value storage and versioning up to the caller.
This will only push the problem up to the caller ;)
I agree with @adlerjohn that two KV stores should be used in library (which of course can be implemented as just prefix-buckets in single database).
This will only push the problem up to the caller ;)
Well, that's the idea. The library user can optimize for whatever they need, and the library doesn't have to try to solve for all cases. You could still provide a self-contained tree map that includes value access via one of the methods discussed. Anyway, just a thought.
I agree that given a trade-off between duplicating data storage, remove the Get functionality, and removing the ability to access old state roots, the latter is the least undesirable.
If there's no other options, I'm in favour of having two KV stores, where one of the KV stores is a path -> value store, and dropping the functionality of accessing old state roots.
What is the timeline for the proposed changes. I would like to coordinate an audit of the library soonish.
@marbar3778 changes are pushed :) I will tag the owners for re-review.
The core PR LGMT, however there is some other things that should be do as a result of this PR:
I'll handle these in follow-up PRs so as to unblock @roysc. Good work on this PR!
Refactors the storage of values into a separate KV store, indexed by
hash(value)+key
. As noted in #14 loses the benefit of deduplication, while enabling pruning of records.Completes work from https://github.com/celestiaorg/smt/pull/24. Revises work from https://github.com/celestiaorg/smt/pull/36.