argumentcomputer / yatima

A zero-knowledge Lean4 compiler and kernel
MIT License
114 stars 8 forks source link

[RFC] Generating Lurk proofs of typechecking #236

Closed arthurpaulino closed 1 year ago

arthurpaulino commented 1 year ago

What we've achieved so far

These are all amazing achievements, and now... onto the next level πŸ‘€

Current painpoints

Our current method to generate Lurk sources of typechecking rely on us hand-writing the TC.Store that feeds the typechecker. Hand-writing the TC.Store instance is not only annoying, but also troublesome in some ways (elaboration can get stuck easily). That's not the only issue: the proof of typechecking becomes store-dependent. That is: we may typecheck the same constant, but if the constant is read from different stores then we will have a completely different Lurk proof. In order to address both points, we will need to change the approach.

A plan of action

I will try to do a wrap up of everyone's ideas so that we can close the logical gaps in our solution. We've already agreed internally that we will use Lurk's commit/open functionality to replace the TC.Store in the final Lurk code. That is, we want to overwrite some TC.getConst (hash : F) with a binder (|TC.getConst| (lambda (|hash|) (open |hash|))) in Lurk. This implies two things:

  1. We need to format the Lurk datums exactly as we do it in the code generator, so that open x will load the very same data as if it were the resulting representation of inductives (something like (CONS <ctor name> (CONS <ctor idx> ...)))
  2. We need to know their Poseidon hashes so we can use such hashes as the input of getConst (and then the respective function, in Lurk, will consume Posiedon hashes)

It implies that our typechecker has to work on top of something like our content-addressed anonymized data. Let me start with some draft of the types we might want.

structure IR.UnivAnonHash  {data : ByteArray}
structure IR.UnivMetaHash  {data : ByteArray}
structure IR.ExprAnonHash  {data : ByteArray}
structure IR.ExprMetaHash  {data : ByteArray}
structure IR.ConstAnonHash {data : ByteArray}
structure IR.ConstMetaHash {data : ByteArray}

This is similar to what we already have, but I'm removing the Kind tag so we don't have to carry it over and replicate on every node of every piece of data. The tags make the Lean code shorter, but also more complex. Also, replicating tags everywhere makes hashing (a little bit) slower and also increases the complexity of serialized data.

Note that I'm using ByteArray to represent the hashes. That's because LightData will hash to ByteArray (once https://github.com/yatima-inc/YatimaStdLib.lean/issues/60 is finished). Moving on.

inductive IR.UnivAnon | ...
inductive IR.UnivMeta | ...

inductive IR.ExprAnon
  | const : ConstAnonHash β†’ List UnivAnonHash β†’ ExprAnon
  | lam   : ExprAnonHash β†’ ExprAnonHash β†’ ExprAnon
  | ...

inductive IR.ExprMeta
  | const : ConstMetaHash β†’ List UnivMetaHash β†’ ExprMeta
  | lam   : Name β†’ BinderInfo β†’ ExprMetaHash β†’ ExprMetaHash β†’ ExprMeta
  | ...

inductive IR.ConstAnon | ...
inductive IR.ConstMeta | ...

Still pretty similar to what we already have (except for the removal of Kind tags). Now, the TC stuff would change a lot:

inductive TC.Univ | ...

inductive TC.Expr
  | const : F β†’ List F β†’ Expr
  | lam   : F β†’ F β†’ Expr
  | ...

inductive TC.Const | ...

Those inductives have the same shapes as the IR.<...>Anon types. The main difference is that here we use Poseidon hashes instead of other hashing algorithms. This will keep the complexity of encoding things as Lurk does restricted to where it's really needed. Keep in mind that encoding data and generating hashes like Lurk does is expensive and consumes a lot of memory, potentially GBs even for small/medium sized code bases. Important point: we don't need to produce this type of data for "meta" data... only for the "anon" portion of the IR data (we want the Lurk proofs to be agnostic to names defined in the original Lean code).

And this would be all the data that comes out of the content-addressing routine:

structure Yatima.Store where
  irUnivAnon  : RBMap UnivAnonHash  IR.UnivAnon
  irUnivMeta  : RBMap UnivMetaHash  IR.UnivMeta
  irExprAnon  : RBMap ExprAnonHash  IR.ExprAnon
  irExprMeta  : RBMap ExprMetaHash  IR.ExprMeta
  irConstAnon : RBMap ConstAnonHash IR.ConstAnon
  irConstMeta : RBMap ConstMetaHash IR.ConstMeta
  irConsts    : RBMap ConstMetaHash ConstAnonHash

  lurkCache    : RBMap LDON F
  lurkUnivMap  : RBMap UnivAnonHash  (F Γ— LDON)
  lurkExprMap  : RBMap ExprAnonHash  (F Γ— LDON)
  lurkConstMap : RBMap ConstAnonHash (F Γ— LDON)

  tcUniv  : RBMap F TC.Univ
  tcExpr  : RBMap F TC.Expr
  tcConst : RBMap F TC.Const

structure Yatima.Env where
  meta   : -- hold information about content-addressing, like hashing algorithms and versions
  consts : RBMap Name (ConstAnonHash Γ— ConstMetaHash)

Yes, that's a lot of data! But it can be reused across different content-addressing rounds by FS persistence.

The CLI API we want looks like this:

yatima ca <Lean 4 source> -s foo.ystore -e myenv.yenv
yatima tc <some Lean 4 declaration> -s foo.ystore -e myenv.yenv

Revamping primitive CIDs detection

Currently, the detection of primitive operations is done in the content-addresser, which makes the outcome (their indices in the array of constants) dependent on the source being content addressed. This solution also forces us to know such hashes before we even start content-addressing arbitrary Lean code.

We should leave this identification as a self-contained task in the typechecker. We can keep the match-by-hash strategy with the help of the new yatima pp command.

Final thoughts

cognivore commented 1 year ago

I understood everything except for why the solution solves something and some other things.

Why is the solution solution?

I assume that because the binders side-step the 🍝 situation with Lurk stores (untangling them)?

For a person who didn't go into the details of how Lurk works yet, but only has a surface understanding, it would be nice to learn more about it.

Why is it expensive to hash?

Keep in mind that encoding data and generating hashes like Lurk does is expensive and consumes a lot of memory, potentially GBs even for small/medium sized code bases

What's the bottleneck? Pardon my ignorance, but I think that it has to be clarified for those who don't have deep understanding.

Are IR.*Anon objects significantly faster to generate? I assume so...

Commits and sizes

You say "it's a lot of data" and mention FS persistence. But does it mean that these commit <LDON node> calls will have to be as big in size as the sum of all the persisted mappings?

arthurpaulino commented 1 year ago

@cognivore I believe the proposed solution is indeed a solution because the typechecker would dereference universes, expressions and constants by looking up a hash that coincides with the hashes that the Lurk evaluator works with. And we know their respective LDON objects so we can ensemble the final Lurk code with them before doing open

Hashing data as Lurk does is expensive because of the encoding:

In contrast, IR data would be serialized with LightData, which is expressive and efficient.

About the commits, the LDON nodes are very cheap. The expensive part is knowing their hashes (type F in my text)