Comprehensive tracing framework

moon-chilled commented 8 months ago

customer: @mariari performer: @moon-chilled deadline: null estimated: null started: null actual: null completed: null confirmed: null dependencies:

[ ]

We should have a framework that allows for logging and viewing all messages received by an engine; also viewing all the intermediate states (either by replaying the messages, explicitly snapshotting, or a combination—periodic snapshots with replay to fill in the gaps).

Probably easiest is to just make it an exs file, though if they start to get really huge we can think about other representations.

moon-chilled commented 8 months ago

Only for engines that don't do any direct side effects (no mnesia, no network), which should be almost all of them. Ideally there would be a way to check this mechanically, but it's probably not that big of a deal.

mariari commented 8 months ago

Agreed this would be a good principled engine to have

moon-chilled commented 8 months ago

Can also store the last k states/messages even in prod to help debugging without being too intensive.

moon-chilled commented 4 months ago

More on this:

conceptual framework https://gu.outerproduct.net/debug.html; basically we want to make a query engine over a kind of ported multigraph. we have a few kinds of objects (nodes):

engine instance (includes a unique label as well as a type of engine; maybe external id and other stuff)
message (sent between engines; has contents, source/destination engine instance, unique id)
events (happen to/on an engine and represent the sending or the receiving of a message)

as well as various relations between these. it could be nice to integrate a proper graph db (datalog w/e) at some point, but elixir should be fine for the time being

moon-chilled commented 4 months ago

Also having a model of an execution means we can define (gasp) a formal concurrency model. This is a very simple axiomatic model but I was bored on the bus and sketched something. Would like to hear from @cwgoes @isheff @heindel @AHartNtkn: would this be useful for specs? If so, are these the properties we want, and are they expressed in a way which is useful for formalising and proving things in specs? In particular the intra-node ordering constraints; I am extremely hesitant both to relax and to strengthen these for the implementation. I know specs had been wanting to work with looser intra-node constraints, but I don't know if that's useful and it might be useful to instead assume some convenient properties.

Notation and conventions: $\xrightarrow{a}$ is a relation (list of pairs). $x \xrightarrow{a} y$ is a predicate equivalent to $(x,y) \in \xrightarrow{a}$. $\xrightarrow{a}^*$ is transitive closure $r$ or $r_i$ denotes only a receive event; $s$ or $s_i$ denotes only a send event. $xyz$ can be any event.

Definitions:

A relation is unique iff, for any two objects related by it, they are uniquely so related:

$$ \begin{align} unique(\xrightarrow{a}) \triangleq\ & (x \xrightarrow{a} y \land x \xrightarrow{a} z \iff y=z)\ \land \ & (x \xrightarrow{a} z \land y \xrightarrow{a} z \iff x=y) \end{align} $$

A relation is acyclic iff its transitive closure never relates an object to itself.

$$ acyclic(\xrightarrow{a}) \triangleq \nexists y, y \xrightarrow{a}^* y $$

Axioms:

Global happens before (ghb) relates all events on a global timeline. It is the union of program order, causality, and intra-node ordering (to be defined) and has no cycles. We generally only care about its transitive closure.

$$ \begin{align} & \xrightarrow{ghb} = \xrightarrow{po} \cup \xrightarrow{c} \cup \xrightarrow{ino} \ & acyclic(\xrightarrow{ghb}) \end{align} $$

Causality (c) relates the sending of a message to the receiving of that message:

$$ msg(s) = msg(r) \iff s \xrightarrow{c} r $$

A message is only sent once:

$$ msg(s_1) = msg(s_2) \iff s_1 = s_2 $$

A message must have been sent to be received:

$$ \forall r \exists s, s \xrightarrow{c} r $$

An intra-node message is received exactly once; it is neither duplicated nor dropped:

$$ \begin{align} & msg(r_1) = msg(r_2) = m \land node(dst(m)) = node(src(m)) \implies r_1 = r_2 \ & \forall s, m = msg(s) \land node(dst(m)) = node(src(m)) \implies \exists r, s \xrightarrow{c} r \end{align} $$

Sent-before (sb) relates two receive events whose corresponding send events happened in order:

$$ s_1 \xrightarrow{ghb}^* s_2 \land s_1 \xrightarrow{c} r_1 \land s_2 \xrightarrow{c} r_2 \iff r_1 \xrightarrow{sb} r_2 $$

Two receives are intra-node ordered iff they are sent-before ordered, and it's not the case that the first one was a cast and the second was a call. (Put another way: within a node, calls can be reordered before casts, but no other reorderings are allowed.) (Note: because $\xrightarrow{sb}$ was specified in terms of $\xrightarrow{ghb}^*$, it's not necessary to use $\xrightarrow{sb}^*$.)

$$ \begin{align} & r_1 \xrightarrow{sb} r_2\ \land \ & node(src(msg(r_1))) = node(dst(msg(r_1))) = node(src(msg(r_2))) = node(dst(msg(r_2)))\ \land \ & \lnot (type(r_1) = cast \land type(r_2) = call) \ & \iff r_1 \xrightarrow{ino} r_2 \end{align} $$

Program order (po) totally orders events affecting a particular engine, and does not relate events affecting different engines:

$$ \begin{align} & engine(x) = engine(y) \iff x \xrightarrow{po}^ y \lor y \xrightarrow{po}^ x \ & unique(\xrightarrow{po}) \end{align} $$

There needs to be some extra stuff here specifying 1) exactly how calls work 2) engine instances have states and engines have transition functions, and effect of receiving a message is consistent with the transition function (or whatever the latest formalism is).

Note: an actual recording framework capturing an execution trace will not have an explicit ghb or sb, which is problematic because ghb and sb are 'defined' in terms of each other. They can be straightforwardly derived with a least-fixed-point algorithm though. I think the lfp will be a valid execution witness iff one exists for the relations we do explicitly capture, but would like confirmation from somebody who knows more math than I do. (This is probably a really simple theorem I just don't know. I guess some kind of monotonicity condition on the consistency predicate is sufficient, though I'm not sure if it's necessary and I'm not sure if we have it? Acyclicity of ghb is monotonic, at least. Edit: oh, yeah, duh, obviously, literally nothing depends on sb except ghb. ghb is monotonic in sb, and the only other axiom about ghb is acyclic(ghb), which is monotonic in ghb. so indeed the desired property for lfp holds.)

AHartNtkn commented 4 months ago

Yes, this is useful for formalization. What you're describing is an Abstract Reduction System; a common topic in term rewriting (See chapter 2 of "Term Rewriting and All That", for example). It's not the typical style we've been using. Typically we've been doing things more logically (through temporal logic) rather than algebraically. What we've been interested in so far for formalization is safety and liveness proofs (see here). One can view such statements as theorems about the (infinite) history of the system as it unfolds from some set of initial states. So we have a specification like

Spec == Init /\ [][Next]_vars

in TLA+ which asserts that the Init predicate always holds of the initial state and the Next relation holds for all subsequent states. We can formulate this as a ARS by viewing Init as an identity relation relating valid initial states to themselves. The spec for possible future states would then be something like

Spec = Init ; (id ∪ Next)*

Spec, as this relation, will relate valid initial states to any state reachable by Next. And safety properties would then be theorems which hold for any y such that (∃x. x Spec y) => P y. If we view a safety predicate as an identity relation, P, relating all safe states to themselves, then, the assertion that P holds would be equivalent to the assertion that Spec ; P = Spec. Contrast this with temporal logic where safety statements are properties that hold so long as they hold for all prefixes of any (infinite) history satisfying the spec.

Liveness properties are less obvious. The temporal logic formulation states that liveness properties are those which are always reachable starting from a finite history. We might reformulate this as

δ (ρ Spec ; (id ∪ Next)* ; P) = ρ Spec

(not completely sure), where P is the identity relation relating states satisfying our liveness property to themselves, ρ is the identity relation corresponding to the range, and δ is the identity relation corresponding to the domain (See 5.219 and 5.220 in "Program Design by Calculation"). This essentially states that the reachable states which can eventually reach a state satisfying P are exactly the reachable states.

"Are they expressed in a way which is useful for formalising and proving things in specs?"

I don't believe this formation has a substantial effect on how easy proving things would be. The proofs would largely proceed in the same way. I think this is more of a stylistic change over using temporal logic. It may be useful as it's more algebraic, able to incorporate properties of relation algebra/allegories. I'm not going to say it's the way to go, but it may be a useful approach. Coincidentally, I was thinking about something similar very recently here, though completely outside the context of debugging.

The main limitation, as I see it, is that the notion of nondeterminism is perhaps narrower than we would like. If the transition is to be viewed as probabalistic, for instance, we need a different formalism. There is some work adapting relational algebra to this setting (e.g. this), but it's not very well developed.

mariari commented 1 month ago

This isn't actionable, so this should be on the research forums

cc @moon-chilled and @AHartNtkn to post it.

anoma / green

Comprehensive tracing framework #199