Closed effectfully closed 1 month ago
constrTermFromConstrData : all (sop :: *). data -> sop
So apart from breaking the type system, the other problem with this is that there is no cheap way for the user to "make it safe". If you are given a Constr
value of the wrong shape, you can't tell without pointlessly pattern-matching on the resulting SOP term. If you don't do that you just have to pass it around and wait for it to explode when you eventually do pattern-match on it (maybe this is fine?).
This is one reason why it would be helpful to be able to pass the shape in (although indeed that causes a lot of complexity).
Thinking about this, it's also notable that this doesn't let us do a super-fast fromData
. Firstly, it gives us a SOP type which has fields of type Data
, which is already awkward since your original type might not look like that, so you might need another type to handle it. Secondly you need to recursively process the fields. So you might end up with something like this:
data MyFoo = A Integer | B [ByteString]
data MyFoo' = A' Data | B' Data
instance UnsafeFromData MyFoo where
unsafeFromBuiltinData d = case constrTermFromConstrData @MyFoo' d of
A' d' -> A (unsafeFromBuiltinData d')
B' d' -> B (unsafeFromBultinData d')
Which incidentally does check the shape by pattern matching immediately! Is this faster than what we do today? I don't know.
@michaelpj sorry, this wasn't ready for review, I was going to add more sections, which I now did, please check them out.
So apart from breaking the type system, the other problem with this is that there is no cheap way for the user to "make it safe". If you are given a Constr value of the wrong shape, you can't tell without pointlessly pattern-matching on the resulting SOP term.
How do you even know how to pattern match on it to ensure safety? Type info is erased, the builtin doesn't know anything about the expectations at runtime unless you make the type info available at runtime like with the singletons approach sketched in the OP. And that latter part is hella inefficient.
Thinking about this, it's also notable that this doesn't let us do a super-fast
fromData
. Firstly, it gives us a SOP type which has fields of typeData
, which is already awkward since your original type might not look like that, so you might need another type to handle it. Secondly you need to recursively process the fields. So you might end up with something like this:
I thought the idea was to speed up the AsData
approach rather than fromData
? So that instead of the O(n) patterns for each data type with vastly inefficient nonsense like this:
unsafeDataAsConstr -> ((==) 0 -> True, [unsafeFromBuiltinData -> i])
we'd have a single neat match like
case unsafeConstrTermFromConstrData ex of
-- 0th constructor with 1 argument
Constructor0_1 (unsafeFromBuiltinData -> i) -> <...>
-- 1th constructor with 2 arguments
Constructor1_2 (unsafeFromBuiltinData -> a1) (unsafeFromBuiltinData -> a2) -> <...>
where Constructor*_*
s compile to anonymous SOPs.
Performance of general fromData
is surely important, but we'll discuss it separately when I finally create an issue about folding builtins.
Or do you think that only optimizing our AsData
and Aiken's handling of data types doesn't cover the entire scope of this issue?
Which incidentally
does
check the shape by pattern matching immediately! Is this faster than what we do today? I don't know.
Yes, but we can't enforce that it's used only this way. Maybe we could wrap it in a CanBlowUp
built-in type and say that the exposure to unsafety is limited by that guard?
How do you even know how to pattern match on it to ensure safety?
I didn't mean we can do it automatically, I meant that the user can do it manually. I.e. if am writing some code that uses constrTermFromConstrData
and I want to be sure that I'm getting the type that I expect out of it, how can I do that? The only option is to pattern match, which will force the error (but also do extra work).
I thought the idea was to speed up the AsData approach rather than fromData?
Yes, it definitely speeds up AsData
but I was thinking it would help both of them! Since fromData
is pretty much just doing the same thing, but recursively.
we'd have a single neat match like
I think you don't need the anonymous SOPs (although that would work) because we can generate an appropriate type for each case:
data MyFoo = MyFoo Data
data MyFoo' = A' Data | B' Data Data
pattern A i -> MyFoo (unsafeConstrTermFromConstrData @MyFoo' -> A' (unsafeFromBuiltinData -> i))
pattern B i1 i2 -> MyFoo (unsafeConstrTermFromConstrData @MyFoo' -> B' (unsafeFromBuiltinData -> i1) (unsafeFromBuiltinData -> i2)
Adding the unsafe constrTermFromConstrData will make it possible for a well-typed case expression to fail with an error message like "expected n arguments to constructor C but got m"
No, you can get far more errors than that. We can produce any a
, so you can do addInteger (unsafeConstrTermFromConstrData d) 1
or (unsafeConstrTermFromConstrData d) 1
, etc. etc. So you can generate essentially all the machine errors that come from non-type correct programs. Of course, we handle those in the UPLC machine, so it's fine in that sense.
Now of course UPLC doesn't have any types and so such an error is already possible in UPLC anyway, however this isn't the case in TPLC or PIR and by extension in UPLC that we get by erasing TPLC.
I think the issues with breaking the type system lie in breaking our assumptions. For example:
let !lies = unsafeFromConstrData d
let ~dead = let !x = case lies of ... in 1
in ...
---->
let !lies = unsafeFromConstrData d
let !x = case lies of ...
let ~dead = 1
in ...
(this isn't quite right but I'm too lazy to get it just right: basically moving a case expression from dead code to live code)
Is this safe? Today we would say that it is: the case
can't fail, so we can move it around freely. In the presence of unsafeFromConstrData
it is no longer safe, because the case might fail, indeed, literally any expression might fail.
So essentially all of our optimizations would become sound "up-to-use-of-unsafeFromConstrData", which is fine but a bit sad.
I didn't mean we can do it automatically, I meant that the user can do it manually. I.e. if am writing some code that uses
constrTermFromConstrData
and I want to be sure that I'm getting the type that I expect out of it, how can I do that? The only option is to pattern match, which will force the error (but also do extra work).
Ah, right.
I think you don't need the anonymous SOPs (although that would work) because we can generate an appropriate type for each case:
That has multiple calls to unsafeConstrTermFromConstrData
, will GHC pull them out and turn into a single call?
No, you can get far more errors than that. We can produce any a, so you can do
addInteger (unsafeConstrTermFromConstrData d) 1
or(unsafeConstrTermFromConstrData d) 1
, etc. etc.
I don't think this can happen? There's a definition of constrTermFromConstrDataDenotation
in the Implementation details section above:
constrTermFromConstrDataDenotation
:: KnownTypeAst TyName DefaultUni sop => Data -> BuiltinResult (Opaque val sop)
constrTermFromConstrDataDenotation (Constr iInteger ds) = do
iWord64 <- integerToWord64BuiltinResult $ fromValue @val iInteger
pure . toConstr iWord64 $ map fromValue ds
constrTermFromConstrDataDenotation _ = evaluationFailure
We do ensure that the given Data
is a Data.Constr
and toConstr
can only return Term.Constr
/CkValue.Constr
/CekValue.Constr
, so I don't think that
indeed, literally any expression might fail.
is correct, but maybe I'm missing something.
I think our biggest problem isn't failures though. We don't do any checking of whether the constructor has the expected number of arguments, we just feed them into the clause represented by a function. If there are too many arguments, we'll get a failure, but what if there are too few of them? The thing will evaluate and we'll get a function back, no failures, and that sounds really dangerous, because that can actually make the script succeed when it really shouldn't. Yeah, this part does sound like a serious issue with embracing the unsafety of the builtin.
That has multiple calls to unsafeConstrTermFromConstrData, will GHC pull them out and turn into a single call?
Dunno, but also we have our own CSE now.
We do ensure that the given Data is a Data.Constr and toConstr can only return Term.Constr/CkValue.Constr/CekValue.Constr
Right, so it will always produce a CekValue.Constr
or fail, but a CekValue.Constr
is not valid in many places! i.e. if the input Data
is a Constr
(so the builtin succeeds), but we invoke the builtin at (a -> b)
, then we will succeed in producing a CekValue.Constr
and then fail when we try to apply that as a function.
If there are too many arguments, we'll get a failure, but what if there are too few of them?
Aaaaaaah. This is horrible. Yet another problem caused by not having case
include explicit binders, I am increasingly regretting that.
So that means you can't even tell if it worked properly by pattern-matching on it...
Right, so it will always produce a
CekValue.Constr
or fail, but aCekValue.Constr
is not valid in many places!
Right, thanks. We could handle that by hardcoding a typing rule for constrTermFromConstrData
stating that it must be applied to a SOP type -- we have more leeway when it comes to checking types as opposed to connecting types with runtime behavior.
So that means you can't even tell if it worked properly by pattern-matching on it...
Yeah, I've now added a description of this issue to the OP. I'm gonna ask research folks to take a look at the whole thing unless you have a bright idea or something.
We could handle that by hardcoding a typing rule for constrTermFromConstrData stating that it must be applied to a SOP type
This would be way too restrictive: for the entirety of the time we are in a typed version of PLC the SOP types are hidden behind type abstractions, so not being able to use this on type variables etc. would be a big problem.
Somebody from the Aiken team (I forgot who it was, sorry) suggested that we could produce a SOP with each of its constructors containing a list of Data
s, so that we get faster branch selection (which is very significant) but still need to process the list of Data
s manually (or maybe with a different builtin). This sounds like it could be easier to achieve, so we should consider it.
Somebody from the Aiken team (I forgot who it was, sorry) suggested that we could produce a SOP with each of its constructors containing a list of
Data
s, so that we get faster branch selection (which is very significant) but still need to process the list ofData
s manually (or maybe with a different builtin). This sounds like it could be easier to achieve, so we should consider it.
It was me in case you have more questions.
We decided to go with pattern matching builtins, so we are not doing this one. Closing in favor of #6225.
Intro
There exist multiple ways to create data types in Plutus Core:
Data
built-in typeSOPs have by far more efficient pattern matching than
Data
(even with the latter properly sped-up). However at the script interface level we haveData
and hence to get that fast pattern matching on SOPs we have to convertData
to a SOP first and this is such an expensive operation that in a lot of cases it completely defeats the purpose as the cost of conversion may significantly outweigh the benefit of having faster pattern matching.A prime example of
Data
being generally preferable over SOPs isScriptContext
, since values of this type are huge and a script doesn't need to access all of them. This problem is discussed in detail in this CIP, which proposes to makeScriptContext
into a SOP instead of the currentData
. This is however problematic in a number of ways:Data
sufficient, particularly since usingData
is simply faster in many casesWhat if instead of trying to force SOPs upon everybody whereever possible we simply make pattern matching on
Data
-encoded data types more efficient, making it competitive with pattern matching on SOPs?High-level idea
There's one deceptively simple way of making pattern matching on
Data
-encoded data types faster: just convert aData.Constr
node into the respectiveTerm.Constr
node and case on the latter instead using the fastcase
primitive.This is how
Constr
is defined forData
:This is how
Constr
is defined for UPLCTerm
:Any
Data.Constr
is embeddable intoTerm
viaInteger
value intoWord64
when possible (this doesn't work with certain interpretations of theInteger
value such as this one, but we ignore those within the scope of this discussion and assume that eachData
object represents a well-defined SOP term withInteger
value being indices of constructors of the encoded data type)[Data]
value into[Term name uni fun ann]
by wrapping eachData
with theConstant
constructor ofTerm
(plus some additional plumbing that we won't get into)For example
will become
(where
Some
andValueOf
are the additional plumbing that we don't get into, since it's largely irrelevant and will likely change anyway)I.e. the outer
Data.Constr
becomes aTerm.Constr
and the contents of the former get embedded as-is, allowing us tocase
on the new value.We're going to call the built-in function implementing this behavior
constrTermFromConstrData
.Differences with faster pattern matching for built-in types
5711 elaborates on the approaches for speeding up pattern matching for built-in types, which may sound directly related to the matters described in this issue, however the two discussions are entirely tangential. Making pattern matching for built-in types faster impacts not only
Data
, but also lists and pairs and potentially other built-in types (maybe evenBool
). However withoutconstrTermFromConstrData
there's no way to access all the arguments of a constructor at once in a singlecase
expression. I.e. without that builtin we'll have to keep doing what we do now, namely recurse onds
fromConstr i ds
and extract arguments one-by-one using pattern matching on built-in lists. That recursion happens on the Plutus side, which is very inefficient, plus it requires a linear amount of builtin calls (regardless of how pattern matching for built-in types is implemented) while withconstrTermFromConstrData
only one builtin call is required (we'll still need a linear amount of work to feed the arguments of the constructor to the function from the respective branch, but that is supposed to be much more efficient than doing builtin calls).Implementation details
Feel free to skip this section, it's here for completeness.
In practice, we don't embed
Data.Constr
intoTerm.Constr
, because at runtime built-in functions receives values, not terms, so we need to embedData.Constr
into the type of values that the specific evaluator uses. It's the same logic as with constants and we use a very similar machinery.So a more accurate name would be
constrValueFromConstrData
or something, but the users don't really care about these trivialities, so we're going to stick toconstrTermFromConstrData
, since it's a decent enough and more specific name, plus it's easier to explain the high-level details this way andconstrTermFromConstrData
kinda looks better in a program ("value" is way too overloaded).The Haskell denotation of the Plutus builtin looks like this:
which follows the logic outlined above: the
Integer
value becomes aWord64
and theData
arguments get turned into values usingfromValue
(whereValue
means a Haskell value, not what we've been discussing so far, told you it was way too overloaded).A partial implementation can be found in this PR.
Typing issue
The runtime semantics of this builtin are trivial, but what type is that builtin supposed to have? Well, a good start is
which says that the builtin converts
data
values to an arbitrary type that is namedsop
, but can be really anything. But this is an overly permissive type signature: we don't want this builtin to claim that it can produce anything, we only want it to claim to produce SOPs. We could try thinking of somehow insisting thatsop
is indeed a SOP (I've no idea how we could achieve that), but even then, we don't want it to be any SOP, we want it to be some very specific SOP, for example:If our builtin was limited to
[[data, data], [], [data]]
, then we could check at runtime thatds
inConstr 0 ds
has two elementsds
inConstr 1 ds
has zero elementsds
inConstr 2 ds
has one elementand that builtin would be reasonably typed and safe to have.
But how could we possibly generalize this example version of the builtin without the ability to quantify over the shape of SOPs and constrain those shapes to have a runtime representation, so that it's possible to check the
Data.Constr
s coming in against that representation? In Haskell we could write something likebut Plutus isn't Haskell and we can't turn such a Haskell function into a built-in one as we can't quantify over
shape
, let alone constrain it with a type class.Maybe with dependent typing we could somehow connect the given
data
value with its expected shape, but this doesn't sound particularly nice either and we don't have dependent types in the general Plutus any way, let alone in its builtins.Builtin families
It is clear that enumerating all practical (let alone possible) instantiations of
constrTermFromConstrData
as individual builtins isn't gonna work: we're not adding hundreds/thousands of new built-in functions. What we can try however is makingconstrTermFromConstrData
a family of built-in functions rather than a single builtin, so thatconstrTermFromConstrDataExample
is just one member of the family. I.e. we'd have a new constructor ofDefaultFun
that unlike all other constructors takes an argument with the argument being the shape of the final SOP:This may rule out some use cases (perhaps if one decides to compile an extensional type theory with W-types to Plutus or something), but this seems so insignificant that it's probably not even worth any additional discussion. How often does the programmer not know the arity of a constructor or the number of constructors of a data type?
The real problem is that adding builtin families invalidates all of our assumptions about builtins, which are, in the language of Haskell constraints,
Enum
,Bounded
andIx
. If you have an infinite number of builtins, you can no longer put something for each of them in an array. You can't put their type signatures in golden files or print them to the user. And we do these kinds of things all over the place.So while retrofitting builtin families into Plutus seems possible in principle, the costs of such an endeavor would be huge and so would be the risk of screwing something up along the way.
Singletons
Introducing builtin families sorta amounts to allowing for type applications to be unerasable for builtins. Which is hard in our existing system, but there's a common way to emulate unerasable types: singletons. Normally singletons require GADTs (which are hard and unergonomic in Plutus Core (see examples) to the point that we prefer to think of them as entirely unsupported), however in our case we might get away with only using a GADT on the Haskell side.
The idea is that we implement the
constrTermFromConstrData
builtin such that it has the following Plutus type signature:where
sopOfData
is a built-in type that restrictssop
to be a sum-of-products-of-data (such as theSOP [[data, data], [], [data]]
example from the above). Each value of a built-in type comes with a runtime type tag, which in this case we can use to establish the connection between the type information and the runtime content ofsopOfData
thereby essentially propagating type info to the runtime where it can be used to ensure the safety of the conversion fromData
tosop
.Assuming it's going to work out, we'll still run into a problem: doing checks at runtime is expensive. Not only will we need to construct and pass around that
sopOfData
singleton, we'll also need to match on it to ensure that it encodes the exact shape of the givendata
object. All for the sake of making our type system better. Would, say, Aiken users care about those checks slowing their smart contracts down? I doubt it.What exactly breaks if we add the unsafe
constrTermFromConstrData
?Adding the unsafe
constrTermFromConstrData
will make it possible for a well-typedcase
expression to fail with an error message like "expectedn
arguments to constructorC
but gotm
" (I'm shooting for the unreachable ideal error message here). Now of course UPLC doesn't have any types and so such an error is already possible in UPLC anyway, however this isn't the case in TPLC or PIR and by extension in UPLC that we get by erasing TPLC.Is this a big deal? I don't know. What I know is this is kinda a not-too-unsafe form of
unsafeCoerce
. Right now one can implement a rather boringsafeCoerce
function that returns its argument back but with a different type:One might use if
a
atype
synonym tob
for example. If it so happens at runtime thata
isn't definitionally equal tob
, then it'll be an immediate evaluation failure (unless we have some really weirdToData
/FromData
instances, which I hope we don't).Adding the unsafe builtin will allow us to define
that does not immediately fail with an error at runtime in case
a
andb
aren't definitionally equal. Instead, it'll return a term that fails beingcase
d on (if the other parts of thecase
-expression are well-typed themselves).So how bad is this? Quite bad actually! Consider the following UPLC expression (pseudo-syntax):
It converts a
Data
valued
to aTerm
, i.e. aConstr
with a bunch of fields, and then applies the\x y -> y
function to the two fields (recall that in UPLC branches of case expressions are merely functions). But what ifconstrTermFromConstrData d
returns aConstr
with three fields rather than two? Well, then the function gets applied to too many arguments and depending on whaty
is we'll likely get an error sooner or later. Likely, but not necessarily: what ify
is a function? Then it gets fed the extra argument that appears last in the overappliedConstr
and the semantics of the term completely derails from what was originally intended by the programmer, meaning at that point they're at the mercy of undefined behavior.The opposite situation when
Constr
has only one argument is equally bad. In that case the\x y -> y
gets applied to only one argument and the whole expression reduces to\y -> y
and if evaluation stops here, then that's evaluation success! Certainly something that the programmer did not intend to end up with. UPD this doesn't apply to Plutus V3 though, because we're making it necessary for scripts to return()
, everything else is a failure.Note that even if we knew how to give a type to
constrTermFromConstrData
, we'd still have all these issues because the on-chain language is untyped. Even if we can guarantee that our toolchain never produces such mess, we cannot guarantee that other toolchains don't, particularly those that target UPLC directly without using any type system.Conclusions
The unsafe
constrTermFromConstrData
is a horrible footgun that we should not add to the language, I believe. Maybe we can make singletons perform reasonably fast, this should be investigated.