IntersectMBO / plutus

The Plutus language implementation and tools
Apache License 2.0
1.57k stars 479 forks source link

Allow for retrieving the index of a SOP constructor #6653

Open effectfully opened 2 weeks ago

effectfully commented 2 weeks ago

Helios folks propose to introduce a builtin allowing one to get i from Term.Constr i args.

Their reasoning is that sometimes one only needs to check whether the constructor is a specific one, instead of providing a branch for every single constructor. E.g. if you have a big sum type like

data D = C1 | C2 | C3 | C4 | <...> | Cn

then you can define equality checking for such a type in linear amount of matches in, say, Plinth:

C1 == C1 = True
C2 == C2 = True
<...>
Cn == Cn = True
_  == _  = False

but on the UPLC level this will give you a quadratic number of branches, because for each Ci to the left of == you'll need to explicitly handle all of C1 ... Cn, because a case has to explicitly handle all possible constructors and there's can't be a catch-all clause.

There's a number of issues with the proposal, though. Firstly, what if those C* constructors carry some arguments? How will we extract them out of there if we only know the index of the constructor?

Secondly, such a builtin would break parametricity, it's the same issues as we relentlessly discussed in the equalsCanonical PR and in #5776 + #6225. I.e. we've already given up on important things because of the parametricity issue, so it's unlikely we're going to revert on that just for the purpose of having a builtin returning the index of a SOP.

Now given that SOPs are basically on par with builtins and we seem to have a plan on making Data as fast as SOPs, do we even need to worry about SOPs at this point? Maybe it'll turn out that just having fast Data is enough? Sure you can't put functions into Data and such, but does anybody actually care about that? Anyway, I digress.

What we could have is case with a default clause. That'd solve the equality issue described above. Helios folks claim it's not enough in the general case though:

The default case would solve the issue for some common patterns, but having access to the index is still needed for the most extreme cases

so naturally I wonder what those extreme cases are.

But let's say that instead of introdcing a basic default clause, we introduce a fancy default clause giving you access to the index of the constructor. It's gonna be a bit weird, because the AST doesn't know anything about integers, but let's say we'll pull it off somehow. Would it be good or weird? It's pretty much reflection at this point and UPLC wasn't designed with any reflection capabilities in mind, so maybe that'll screw the metatheory up somehow?

Anyway, I'll mark this as low-priority, because speeding up Data is far more important, but now that Helios folks told us about this problem I do find the lack of default clauses unfortunate, so maybe we should do something about it.

christianschmitz commented 2 weeks ago

@effectfully Thanks for taking the time to write this up. We are very appreciative of the time you have personally spent to help bring more utility to SOPs.

It's fair that you mark this issue as low priority, making Data faster will indeed provide much more utility.

I wouldn't completely write off SOPs yet though, as they do have their use-case: efficient multiple branches.

On X I mentioned two issues that could be solved with a builtin that can retrieve the case index:

  1. comparing the index
  2. falling back to nested ifThenElse calls when combining multiple SOPs

1. comparing the index

Without such a builtin function, it will be easier to keep using Data.

equalsInteger(fstPair(unConstrData(a)), fstPair(unConstrData(b))), which is constant time, thus much better than even the linear time Plinth example you gave above.

2. combing multiple SOPs

This is easier to discuss with an example, typically seen in real-life Helios code.

Assume we the following enum:

enum State {
  A {<fieldsA>}
  B {<fieldsB>}
  C {<fieldsC>}
}

We want to compare the state of a given UTxO being spent with the state of the return UTxO: so state0 vs state1. These states will be stored in the Datums, so are initially in Data format (so State::A will be encoded as ConstrData(0, <fieldsA>), State::B as ConstrData(1, <fieldsB>), etc.)

Let's assume we have the following valid state evolution:

The equivalent in code:

(state0, state1).switch{
  (A, A) => {<branch1>}
  (A, C) => {<branch2>}
  (B, B) => {<branch3>}
  (B, C) => {<branch4>}
  (C, C) => error("illegal state change")
  (C, _) => {<branch5>}
  (_, _) => error("illegal state change")
}

If the State is kept as Data such a branching expression is easy to generate using multiple nested ifThenElse calls that compare the ConstrData tags. This approach is not very efficient, requiring many UPLC terms.

Without default case (current situation)

If the State is first converted to SOP, using constr(fstPair(unConstrData(State)), sndPair(unConstrData(State))), we would end up with the following UPLC AST:

case(
  state0AsSOP, 
  (fieldsA0) -> {
    case(
      state1AsSOP,
      (fieldsA1) -> {
        <branch1>
      },
      (fieldsB1) -> {
        error("illegal state change")
      },
      (fieldsC1) -> {
        <branch2>
      }
    )
  },
  (fieldsB0) -> {
     case(
      state1AsSOP,
      (fieldsA1) -> {
        error("illegal state change")
      },
      (fieldsB1) -> {
        <branch3>
      },
      (fieldsC1) -> {
        <branch4>
      }
    )
  },
  (fieldsC0) -> {
    case(
      state1AsSOP,
      (fieldsA1) -> {
        <branch5>
      },
      (fieldsB1) -> {
        <branch5>
      },
      (fieldsC1) -> {
        error("illegal state change")
      }
    )
  }
)

Although this particular example doesn't seem so extreme, you should imagine many more variants, which makes the generated code blow up with many duplicate branches.

In summary: we currently consider SOPs only viable for branching over a single ConstrData value, which has been converted using constr immediately before the branching expression.

With simple default case

I'm assuming the last case would always be used as the default. This wouldn't improve anything for this example because there is no situation where 2 or mores final branches are the same.

So for this example implementing a default case isn't worth the effort.

With fancy default case

This proposal doesn't seem very composable.

Wouldn't it then make more sense to implement a variable-args version of ifThenElse instead?

The arguments of such a term could also be implicitly delayed:

varArgsIfThenElseTerm(<cond0>, <case0>, <cond1>, <case1>, <cond2>, <case2>, <default>)

I also think this would be greatly beneficial to languages with advanced pattern matching syntax like Aiken.

To be honest I think this is too high-level for UPLC, though there seems to be some support for it: #6578

With a builtin that can retrieve the case index

We would be able to keep the State in the example above as SOP throughout the code-base, and simply fallback to nested ifThenElse calls for multi-valued branching situations.

A function for retrieving the constr args would also be needed though.

Are SOPs worth it?

SOPs can currently provide benefit in only a few situations. Are there other situations that I hadn't thought of?

Sorry for being frank, but maybe SOPs aren't worth the complexity they introduce to the CEK machine, and should be removed for UPLC version 4.

The branching functionality of SOPs could simply be replaced by allowing builtin lists to contain lambdas, and allowing a way to define such a list literally (eg. mkCons(<lambda0>, mkCons(<lambda1>, mkLambdaList(())))

I also agree with @nau that we should use either only Data or only SOPs. As SOPs were never able to provide a complete alternative to Data we are now in this awkward situation where we mix both for only minor benefit.

effectfully commented 2 weeks ago

I wouldn't completely write off SOPs yet though, as they do have their use-case: efficient multiple branches.

Yes, but this is exactly what I want to bring to Data: #6602

This is easier to discuss with an example, typically seen in real-life Helios code.

Thanks a lot for providing the example, I agree with your assessment that my idea as written of using the default case is basically useless. I mixed up pattern matching in normal languages and UPLC and suggested something stupid because of that.

With a builtin that can retrieve the case index

A function for retrieving the constr args would also be needed though.

Well, as I'm arguing in the issue description, such builtins have never made it past peer review. Given the relative unimportance of SOPs, I doubt this time would be any different.

Sorry for being frank

Frankness is always appreciated, we're very much interested in users providing honest feedback as that allows us to improve the product.

maybe SOPs aren't worth the complexity they introduce to the CEK machine

I've been gravitating towards the same conclusion. If we can make Data as efficient as SOPs and point people towards Scott-encoding as a fallback solution when putting functions into data types is necessary, then SOPs are perhaps not worth having. But that would only remove the Constr node from the AST, not the Case one (as the latter is gonna be used for Data and other builtins), so it's not like we'd save much. It may be just easier to keep SOPs in the AST anyway, since maintaining them doesn't take much effort. Dunno.

Anyway, we are very much focused on making Data universally better than SOPs, so I'm not just saying these things to avoid improving SOPs.

The branching functionality of SOPs could simply be replaced by allowing builtin lists to contain lambdas

Probably not gonna happen, because it's hard to fit into the current design of UPLC. And if it did, everybody would start complaining that now you can have a list of Integers and a list of terms wrapping Integers, which are two different AST representations for semantically the same thing. Complain like this:

also agree with @nau that we should use either only Data or only SOPs. As SOPs were never able to provide a complete alternative to Data we are now in this awkward situation where we mix both for only minor benefit.

:stuck_out_tongue:

Thank you for this discussion, it really helped me to start disliking SOPs more :slightly_smiling_face:

christianschmitz commented 2 weeks ago

Making case directly more useful for Data is a good spin (#6602), and then keeping the constr term for backward compatibility isn't too big a price to pay.