agentm / project-m36

Project: M36 Relational Algebra Engine
The Unlicense
899 stars 48 forks source link

Implement "lazy commits" #56

Open agentm opened 7 years ago

agentm commented 7 years ago

Introduction

An commit to the database graph is lazy if, instead of committing the result of an update to disk, we commit the command ("the diff") to the disk. This would allow essentially any update to be committed immediately without delay as long as the potential results would not violate database constraints.

This is possible in Project:M36 because all updates are pure/idempotent and therefore can be applied later to create the exact state which would have otherwise been created. The direct analogy to Haskell would be laziness with some kind of checkpoints (such as STM).

Instantaneous (lazy) transaction commits have the potential to offer Project:M36 a strong competitive advantage.

Constraint Violations

The potential constraint violation is key here. If we can prove statically that no constraints can be violated, than we do not need to execute the update.

For example, if an update references a relvar "X" and no constraints refer to relvar "X", then we tautologically know that the update cannot violate the constraints and can be lazily evaluated later (if necessary).

In the worst case, Project:M36 can evaluate the update expression, discard the results (or save them for later), and commit the diff if committing the diff would be faster than writing the whole results. In many cases where changes exceed the size of the update expression, the size difference would be large and the choice obvious.

Deferred Execution

Lazy commits need not only be in place on the head of a branch, but can also be stacked, as long as each layer of the stack would not violate constraints. This is correct because multiple commits would be effectively equivalent to the update expressions concatenated in one commit (squashed commits) because they are all idempotent.

Eventually, or when necessary or convenient (such as during lower usage times), lazy commits can be replaced with their evaluated equivalents. Still, the lazily-committed transaction is completely crash-safe specifically due to the idempotence.

The research needed to complete this implementation surrounds the question of when an update violates a constraint. For example, if a constraint specifies that all integer values "a" in relvar "X" must be less than "5" and the update adds "1" to all values "a" in relvar "X", then an obvious optimization is to look for the highest value in the values "a" and add "1". If that value is greater than "5", then we can reject the update right away. Note that this is a general constraint-check violation which can be used independently of lazy commits, but the value of lazy commits increases if we don't actually manually need to execute the inclusion dependencies' relational expressions. In this case, the optimization works because the data types has a clear ordering, but other data types may not.

More research is also needed to accommodate user-defined and algebraic data types.

It would be tremendously valuable to find an existing mathematics describing these sorts of optimizations instead of attempting to tediously implement a guessed subset of possible optimizations.

To recap, in the worst (default) case, we can execute the update in a database context, validate the constraints, and, if the constraints are not violated, commit the diff instead of the actual changed relvars and their tuplesets. However, that same result could be used shortly thereafter to replace the lazy commit with the actual results. Still, this worst case approach would allow for a commit time which is merely the time to execute the update, validating the constraints, and commit the diff rather than execute the update, validating the constraints, and committing the new relvars. With proper static optimizations in place for constraint validation, the commit time could even be O(1).

Conclusion

A relational algebra engine can make many more interesting optimizations if the transactions can be purely evaluated. One of these optimizations are lazy commits wherein a transaction can be written to disk with potentially zero evaluation resulting in an instantaneous commit. The cost of the transaction can be deferred and users then need not actually wait on the results of an expression to be evaluated in order to commit to disk.

agentm commented 7 years ago

Another rather obvious optimization would be to detect if an inclusion dependency operates independently on tuples (no aggregate functions or cross-relvar references) and, if so, run the constraint checker against a relation built from the updated tuples only instead of all the relvar's tuples.

This optimization could likely be easily extended to cover the exceptional cases mentioned above. For example, if updates were issued across multiple relvars, it would be natural to create a sub-schema containing only the relevant relvars (including those mentioned in the inclusion dependencies) and then run the constraint checker against that sub-schema.