Faster processing of `Data` objects

effectfully commented 1 week ago

Preface

This issue is a continuation of #5777 (I'm going to call it "the previous issue") and offers a different perspective. Reading the previous issue is a prerequisite for reading this one.

The previous issue explained how it'd be beneficial to have a way of matching on Data objects by converting them to SOPs first: this would make processing of Data much more efficient. The issue previous also explaines how there doesn't appear to exist any safe and efficient way to perform the conversion. Hence here we're going to investigate alternatives.

The goal

We want to be able to express the following (pseudocode):

inspectConstr (Constr 2 [d0, d1, d2])
    [ \[v0, v1]         -> body0
    , \[v0, v1, v2, v3] -> body1
    , \[v0, v1, v2]     -> body2
    , \[]               -> body3
    ]

where inspectConstr is some function that takes a Data object and a bunch of branches (in the form of a list or an array or something), picks the right branch depending on the index of the constructor (i.e. the first argument of Constr) and applies the multi-argument function of that branch to the arguments of the constructor (i.e. what's stored in the second argument of Constr). So the example above evaluates to

(\[v0, v1, v2] -> body2) [d0, d1, d2]

where \[v0, v1, v2] is a multi-lambda binding three variables and [d0, d1, d2] is some kind of a spine (a list or an array or etc, can be the same thing as what we store the branches in, can be different). That expression is equivalent to

(\v0 v1 v2 -> body2) d0 d1 d2

and evaluates the same way.

Note that regardless of how we're going to represent this expression in the AST we want to only allow exactly-saturated applications for multi-lambdas, i.e. neither undersaturation (a multi-lambda binds more variables than the number of arguments it's applied to) nor oversaturation (a multi-lambda binds fewer variables than the number of arguments it's applied to) are allowed. The reason for this restriction is precisely what allows us to avoid all the problems described in the "What exactly breaks if we add the unsafe constrTermFromConstrData?" section of the previous issue.

Is there a straightforward way to make `inspectConstr` work?

The answer is "yes", I believe. The following pseudocode definition would cut it:

inspectConstr : forall r. array (array data -> r) -> data -> r
inspectConstr branches (Constr i args) = apply (branches ! i) (toArray args)

To make it work we need to

(1) add array to the language and not as a built-in type, because values of built-in types can only be constants but here we need to have an array of functions (2) teach the builtins machinery about array (3) add multi-lambdas and multi-applications to the language (or something along these lines) (4) allow for returning applications from within the builtins machinery

(2) and (4) are simple. (4) has a prototype implementation here and (2) is very similar to teaching builtins about SOPs, which has a prototype implementation here

(1) and 3 are hairy but doable.

The real question however is whether we can avoid doing a part of this work by piggy-backing on some existing machinery like SOPs.

Can we use SOPs to avoid extending the language as much?

Instead of having multi-lambdas and multi-applications we could use SOPs to bind a bunch of variables at once, but the problem with that is that it doesn't give us exact saturation, hence SOPs don't seem to be of any help here. In retrospect, we should've probably made SOPs exactly saturated and it's probably too late to do it now. But maybe we could make multi-lambdas play nicely with SOPs and recover exact saturation for SOPs this way, but that requires adding multi-lambdas to the language in the first place.

But let's assume we somehow have exact saturation for SOPs and the following is expressible in the AST (note that it's Term.Constr where it was Data.Constr before):

case (Term.Constr 2 [d0, d1, d2])
    [ \[v0, v1]         -> body0
    , \[v0, v1, v2, v3] -> body1
    , \[v0, v1, v2]     -> body2
    , \[]               -> body3
    ]

we still don't know how to convert a Data.Constr to a Term.Constr as per the "Typing issue" section of the previous issue.

Moreover, even if wanted to only use SOPs to pick the right branch, without applying the function in that branch, binding multiple variables etc -- just to pick the branch, we don't know how to give a type even to that built-in function. In a dependently typed system it would actually be easy:

constrTermFromConstrArgsData : (n : Nat) -> data -> SOP (replicate n [data]))

for example

constrTermFromConstrArgsData 0 : data -> SOP [[]]
constrTermFromConstrArgsData 1 : data -> SOP [[data]]
constrTermFromConstrArgsData 2 : data -> SOP [[data], [data]]
constrTermFromConstrArgsData 3 : data -> SOP [[data], [data], [data]]

At runtime this builtin would check that the given n matches the number of arguments of the given Data.Constr and either fail if it doesn't or return the SOP of the appopriate shape. This would be an entirely safe builtin to have, it would throw immediately on incorrect input and return a well-typed thing for correct input. The problem of course is that we don't have dependent types in Plutus.

We could emulate them using singletons as per the "Singletons" section of the previous issue like this:

constrTermFromConstrArgsData : forall sop. SopOfListData sop -> data -> sop

This would require us to extend the language with whatever SopOfListData is (a constructor of Type? A built-in type?), as well as teach the builtins machinery about it, at which point it's probably just easier to add arrays to the language and be done with it.

I therefore cannot think of any way to leverage SOPs for faster processing of Data objects. And it makes sense: SOPs are richly typed in that they reflect the structure of their values at the type level, but this is precisely what we don't want when we're dealing with the "untyped" Data objects that have unknown structure. Reflecting that unknown structure at the type level just to give up on it immediately after in a case expression is a wasted effort, matching directly without jumping through typing hoops is more natural and has to be more efficient -- and efficiency is what motivates this whole issue.

Summarizing, (2), (3) and (4) all seem to be necessary. (1) may or may not be replaceable with the singletons approach to constrTermFromConstrArgsData, maybe it's worth investigating that but if we're going to add spines (lists/arrays/whatever) for (3) anyway, then we can reuse them for (1) without bothering with SOPs at all.

MicroProofs commented 1 week ago

So switching gears to multi-apply and multi-lambda and multi-apply and also adding arrays as a plc term right? Just to summarize this PR.

effectfully commented 1 week ago

So switching gears to multi-apply and multi-lambda and multi-apply and also adding arrays as a plc term right? Just to summarize this PR.

Yes, so far this seems like the most plausible solution to me (just to clarify, it's an issue, not a PR, I haven't written any code for that quite yet), but others may have better ideas and I'm not entirely confident the idea is gonna work out and be worth it.

effectfully commented 1 day ago

@IntersectMBO/plutus-core please review this issue carefully and check whether there are any flaws in my reasoning or whether there's something missing. If I don't get any negative feedback, I'm going to proceed with shipping pattern matching builtins instead of the SOPs approach (see this issue for details on what the two approaches are).

IntersectMBO / plutus