Open raykyri opened 7 months ago
alright I think finally understand this, so let me summarize and tell me if it's right so far
context/motivation:
db.get
db.get
has to be deterministic, which means it must only use the effects of the current message's transitive ancestors, and nothing elseproposed alternative:
db.get(model, key)
identifies the set of "concurrent" values for the key, ie the values from all the actions in the set of transitive ancestors who wrote (set or delete) to that key and were not subsequently overwritten by a more recent ancestor.$merge
function, and the result is returned from db.get
. if the set is empty, db.get
returns null
.and then the branch indexing thing is a way to implement getConcurrentValues(messageId, model, key)
yes that's exactly it
it's potentially expensive in a world with many concurrent merge branches. it may be the case that we want to "settle" the results of the merge function back to the log by making the equivalent of a git merge commit.
ok this makes sense to me
EDIT: disregard this entire comment, this is not sound unfortunately :(
Bringing the design here a little closer to code - we want to implement an internal utility method getConcurrentValues
which will allow us to do the "lazy merging" described here.
declare function getConcurrentValues(
parents: string[],
model: string,
key: string,
): Iterable<{ messageId: string; value: ModelValue }>
When an action handler calls db.get
for a model with a merge function, we iterate over the result of getConcurrentValues(parents, model, key)
, reducing the values using the user-provided merge
. This means AbstractRuntime.getModelValue
will get factored into two "cases" - the current one which implements last-write-wins, and a new case for merge reduction.
The main problem is how to implement AbstractRuntime.getConcurrentValues
. One way to do this without adding additional indices is to use ${model}/${keyHash}
key in the $effects
table as a mutable cache of concurrent message IDs (string[]
).
Whenever a record is set or deleted, we look up its current concurrent effect IDs:
const effectIDs: string[] = db.get("$effects", `${model}/${keyHash}`)
Then we filter out any that are ancestors of the current message, and add the current message ID.
// context: ExecutionContext
const { parents } = context.message
const filteredEffectIDs = effectIDs.filter(
(id) => parents.some(
(parent) => !context.txn.isAncestor(context.id, parent)
)
)
db.set("$effects", {
key: `${model}/${keyHash}`,
value: [...filteredEffectIDs, context.id]
})
Then to implement AbstractRuntime.getConcurrentValues(parents, model, key)
we can do
function getConcurrentValues(context: ExecutionContext, parents: string[], model: string, key: string) {
const keyHash = AbstractRuntime.getKeyHash(key)
const concurrentEffectIDs: string[] = db.get("$effects", `${model}/${keyHash}`)
return concurrentEffectIDs.filter((id) => context.txn.isAncestor(context.id, id))
}
We want to allow contract writers to use CRDTs when writing applications, so that they can implement collaboratively mutable data structures like:
To support CRDTs, we could implement them individually in the model framework, but a more general solution would be to add a
$merge
function that is called whenever a conflicting set of writes is seen in the database table:The merge function takes two (and perhaps more) arguments, each of which represents the row that was edited concurrently along different causal paths. It merges those rows together and returns the merged row, which gets written to the database.
To figure out the arguments to $merge, we need to detect whether there are conflicting / unmerged branches in the causal log (relative to the record that is returned when we call
db.get()
) and fetch the most recent write to that database row. Note that there may be zero, one, or many unmerged branches, but we can just recursively call the merge function to support multiple unmerged branches.Branch Indexing
First, we track of the number of parallel branches of history at any given time. We do this by keeping a
branch
counter, which starts at 0 and is incremented every time an action is added and any of its parents already have children. We annotate each message in the GossipLog store with the index of the branch that it's on.Whenever the causal log branches, the
branch
counter is incremented by 1, and nodes on the newly forked branch are assigned the largest value of thebranch
id, which looks like this:Alternatively, if an action at time X
A_x
doesn't create a new branch, then it just inherits the min of the indexes of its parents.The branch counter is only tracked locally -- different nodes may have different branch indexes for the same message, depending on the order in which they receive operations. (Branch indexes are never reused, so as the log fills, the branch index will grow larger.)
Second, we keep track of the maximum branch index at each clock value. This can be done when each message is added to the log - we can just take the
max
of the newly added message's branch index, and the current maximum branch index.Third, we also need to keep track of where branches were merged into another branch, since it's possible that if we're querying for an action
A_t
at time t on branchB'
, the most recent write to the model actually happened on another branchB
that merged into branchB'
. So, whenever a branch is merged (i.e. an action is created with parents with multiple branch indexes) then we keep a record of the merge. Later, recursively traversing this graph will allow us to get the value of the model at clock timeT
.Querying
Lazy merging: We call the merge function lazily - otherwise, we would have to scan each model store in its entirety for unmerged pairs of actions every time a message is received with multiple parents. (A different approach to merge functions might implement eager merging, where we efficiently diff the model database when an action with multiple parents is published, but this requires implementing a Merkle index or Prolly tree at the model table level, which is out of scope for this exercise.)
To decide which model values are provided as arguments to the merge function, we have to compute what the model value
M_x
was for each parentA_x
of the message.We can do this by looping over each parent:
M_x
on the branch of the parent. This is now the lower boundb
for further queries: we are only looking for other branches that could possibly have written toM_x
more recently thanb
. (Note thatb
might be undefined, if write toM_x
occurred on the branch.)A_x
afterb
; we call this the parent set.M_x
. For any values ofM_x
we find on these parallel branches, add them to the list of arguments to the merge function.M_x
was found on the branch that merged into the parent).This allows us to collect all the values of
M_x
from branching topologies that look like this:The query algorithm should produce a set of model values
M_x1, M_x2, ...
. Deduplicate these values (by message ID) and pass them to the merge function.