[CLOSED] Implement prototype of streaming delta hash

aionbot commented 5 years ago

Issue created by jeff-aion (on Monday Oct 22, 2018 at 19:27 GMT)

Our KeyValueObjectGraph currently implements a binary Merkle tree of the individual objects stored in the graph, but does so as a piece of data "on top" of the data store, since there is no actual relationships upon which one could build the Merkle tree, lower in the stack.

This current approach is incredibly heavy-weight and inefficient and, while a better implementation is possible (the existing implementation is just a demonstration so it has some rather naive edges), it really isn't a good fit for our storage model.

As a counter-point to this, and a way of building something concrete as a talking-point, we should also implement the "streaming delta hash" described here: https://aionnetwork.atlassian.net/wiki/spaces/AVM/pages/114852163/Object+Graph+Hash

Both of these implementations should be able to coexist, for the time being, gated by a static flag constant. This will allow a concrete example of the dilemma and also makes comparisons testable in a very concrete way.

aionbot commented 5 years ago

Comment by jeff-aion (on Monday Oct 22, 2018 at 19:55 GMT)

Since this is based on needing to know the hash of an object before and after a change, it makes sense to do this after #268 since that item requires managing before-and-after state.

aionbot commented 5 years ago

Comment by jeff-aion (on Friday Oct 26, 2018 at 16:06 GMT)

We want to create this hash on a logical level (so the data store implementation doesn't matter) but this requires answering 2 questions:

How do we pack primitives into a hashable piece of data?
How do we encode object references into a hashable piece of data?

Here is a proposal we can use as a talking-point: An object instance will be serialized into a byte[], for the purposes of logical hashing (meaning we need to do this for the before and the after) using the current abstract IObjectSerializer to produce a byte-packed representation of primitives (big-endian encoded multi-byte values) and references (encoded as an int - 0 for null and identity hash for all other instances) in the order they are declared (also means that super-class fields are encoded before sub-class fields). Until we determine a real hash algorithm to use, we will return the i32 value from Arrays.hashCode(byte[]).

In terms of implementation, we still want to bill for loading the instance when the user attempts to read this primitive hash but we want our own code to be able to see it (when serializing a reference) for free. This is really the core complexity of the implementation.

Beyond that, the only difficulty should be in collecting the data according to that packed structure (since most of our representations actually split primitives and references - we might want to change the proposal to describe that segregated structure although it seems like a special case).

I will start prototyping to get a sense of whether or not a different encoding might make more sense.

aionbot commented 5 years ago

Comment by jeff-aion (on Friday Oct 26, 2018 at 21:32 GMT)

As it stands, we currently use a segregated encoding through basically the entire system, and this is actually done for a reason which implies any implementation would probably want to do this: references have structural value for things like GC operations or introspection of the data.

For this reason, I will proceed with this prototype as described above, except that all references will be serialized before any of the primitives, in the resulting byte[] used to compute the hash.

This is unrelated to the serialized form a given data store might want to use, internally, but should be easy to maintain for these purposes.

aionbot commented 5 years ago

Comment by jeff-aion (on Tuesday Oct 30, 2018 at 19:42 GMT)

Interestingly, this is incurring much more formalization through the stack, which is a good thing. While using a 4-byte big-endian identity hash for normal instances would work, constants and classes require further consideration.

Constants currently use a negative long instanceId, which is technically a bleeding through of the implementation details of our initial data store implementation (but should now be isolated within KeyValueObjectGraph). Instead, we should convert those into having a 1-indexed identity hash. This can work both as the logical reference value but the KeyValueObjectGraph can also use this as a way to derive the negative long it currently expects.

Classes are a different story, since they are special in that we intern them per-DApp but they are still normal instances within that namespace (whereas constants are allocated outside of any DApp). This means that we will handle them as normal instances.

A question we should keep in mind is whether or not to this logical representation should reflect the billable size since we probably want to move those into a shape where they match (it would be one less special case in the specification).

aionbot commented 5 years ago

Comment by jeff-aion (on Tuesday Oct 30, 2018 at 20:04 GMT)

The billable size mentioned in the previous comment isn't a concern, as shown in 96ddb04.

aionbot commented 5 years ago

Comment by jeff-aion (on Tuesday Oct 30, 2018 at 21:23 GMT)

Note that Class objects do not increment the nextHashCode of the Helper and the identity hash we use for these instances is actually the hash of the class name, so that will also be a well-defined general case for this hash.

Added a comment to the document about this item (https://aionnetwork.atlassian.net/wiki/spaces/AVM/pages/114852163/Object+Graph+Hash) to reflect the idea we had that we might be able to increase the security of the delta hash approach by using a tuple of hashes, instead of just one. May be worth adding if this concern persists and we believe that approach will produce a real increase in security.

aionbot commented 5 years ago

Comment by jeff-aion (on Thursday Nov 01, 2018 at 17:16 GMT)

Something to keep in mind with this is that the storage hash can only be observed once the entire transaction has completed. This actually means that there are 2 different ways of handling the re-entrant case:

Handle it along-side where we handle the billing decision (if we detect that it changed, and want to bill for a write, also update the hash). This does mean that the parent call will need to delta against what the child call "wrote", although that information should probably already be available for making the top-level write decision.
Handle it only when we write-back at the top-level. This would be faster but is a special-case. Note that we would still need to capture the hash of writes of new objects from child calls which don't survive parent calls (although that is required, in general).

aionbot commented 5 years ago

Comment by jeff-aion (on Friday Nov 02, 2018 at 14:07 GMT)

At least until #296 is implemented, I will proceed with using suggestion (1). To this end, I will wrap the existing IStorageFeeProcessor in one which also computes and updates the delta hash. While this might seem like an odd fit, the write* aspects of this interface need to be touched in the same set of places where the delta hash would be computed, so this allows us to pass just one object through, which can handle both cases, without needing duplication at every callsite.

In the future, if we can transition to suggestion (2), this will probably be untangled to compute the delta hash at the top-level in a more out-of-line way.

aionbot commented 5 years ago

Comment by jeff-aion (on Friday Nov 02, 2018 at 14:18 GMT)

On second thought, the IStorageFeeProcessor interface is the right idea but the wrapping interface needs a little more information (writes need before-and-after data, not just the total size).

I will create an interface (or class) which more explicitly wraps these cases together.

aionbot commented 5 years ago

Comment by jeff-aion (on Friday Nov 02, 2018 at 14:48 GMT)

I keep going back-and-forth on this one but the hand-off point for the delta hash, during reentrant calls, is actually pretty complicated and, I suspect, will be easy to get wrong or will be brittle in the face of any future changes (since it requires making strong assumptions around when the hash is updated by various invokes in the stack).

In light of this, it actually does make more sense to go with option (2) and implement this within the IObjectGraphStore. It also means that certain read caching decisions can be made more effectively (avoiding going back to the underlying DB to diff a representation to check for changes) and avoids needing to build a richer interface around the GC operation (since it can fully calculate the hash, internally).

I had originally thought that this would complicate some cases of how we capture the hash of objects which callees leave reachable but callers orphan, prior to write-back, but realized that this handling is required for that, anyway: we don't just need to capture the hash, but we actually do need to write this to the DB so that the hash consistently represents the actual state of the storage (even though the object, itself, is written once we already know it is unreachable). Note that this problem will be generally solved under #296, as I currently suspect it is incorrectly discarded.

aionnetwork / AVM

[CLOSED] Implement prototype of streaming delta hash #291