arthurpaulino commented 1 year ago

What we've achieved so far

We can content-address (apparently) any Lean 4 code base
We can can typecheck almost any Lean 4 code base with our own typechecker that runs on top of our own types
We have built a nicely composable code generator that consumes executable Lean 4 code and generates executable Lurk code
We've been able to produce Lurk code that executes the logic that we've implemented in our typechecker 🐙 (modulo a some hiccup from #233)
Our code generator was also capable of generating Lurk code for our WASM interpreter 🐙

These are all amazing achievements, and now... onto the next level 👀

Current painpoints

Our current method to generate Lurk sources of typechecking rely on us hand-writing the TC.Store that feeds the typechecker. Hand-writing the TC.Store instance is not only annoying, but also troublesome in some ways (elaboration can get stuck easily). That's not the only issue: the proof of typechecking becomes store-dependent. That is: we may typecheck the same constant, but if the constant is read from different stores then we will have a completely different Lurk proof. In order to address both points, we will need to change the approach.

A plan of action

I will try to do a wrap up of everyone's ideas so that we can close the logical gaps in our solution. We've already agreed internally that we will use Lurk's commit/open functionality to replace the TC.Store in the final Lurk code. That is, we want to overwrite some TC.getConst (hash : F) with a binder (|TC.getConst| (lambda (|hash|) (open |hash|))) in Lurk. This implies two things:

We need to format the Lurk datums exactly as we do it in the code generator, so that open x will load the very same data as if it were the resulting representation of inductives (something like (CONS <ctor name> (CONS <ctor idx> ...)))
We need to know their Poseidon hashes so we can use such hashes as the input of getConst (and then the respective function, in Lurk, will consume Posiedon hashes)

It implies that our typechecker has to work on top of something like our content-addressed anonymized data. Let me start with some draft of the types we might want.

structure IR.UnivAnonHash  {data : ByteArray}
structure IR.UnivMetaHash  {data : ByteArray}
structure IR.ExprAnonHash  {data : ByteArray}
structure IR.ExprMetaHash  {data : ByteArray}
structure IR.ConstAnonHash {data : ByteArray}
structure IR.ConstMetaHash {data : ByteArray}

This is similar to what we already have, but I'm removing the Kind tag so we don't have to carry it over and replicate on every node of every piece of data. The tags make the Lean code shorter, but also more complex. Also, replicating tags everywhere makes hashing (a little bit) slower and also increases the complexity of serialized data.

Note that I'm using ByteArray to represent the hashes. That's because LightData will hash to ByteArray (once https://github.com/yatima-inc/YatimaStdLib.lean/issues/60 is finished). Moving on.

inductive IR.UnivAnon | ...
inductive IR.UnivMeta | ...

inductive IR.ExprAnon
  | const : ConstAnonHash → List UnivAnonHash → ExprAnon
  | lam   : ExprAnonHash → ExprAnonHash → ExprAnon
  | ...

inductive IR.ExprMeta
  | const : ConstMetaHash → List UnivMetaHash → ExprMeta
  | lam   : Name → BinderInfo → ExprMetaHash → ExprMetaHash → ExprMeta
  | ...

inductive IR.ConstAnon | ...
inductive IR.ConstMeta | ...

Still pretty similar to what we already have (except for the removal of Kind tags). Now, the TC stuff would change a lot:

inductive TC.Univ | ...

inductive TC.Expr
  | const : F → List F → Expr
  | lam   : F → F → Expr
  | ...

inductive TC.Const | ...

Those inductives have the same shapes as the IR.<...>Anon types. The main difference is that here we use Poseidon hashes instead of other hashing algorithms. This will keep the complexity of encoding things as Lurk does restricted to where it's really needed. Keep in mind that encoding data and generating hashes like Lurk does is expensive and consumes a lot of memory, potentially GBs even for small/medium sized code bases. Important point: we don't need to produce this type of data for "meta" data... only for the "anon" portion of the IR data (we want the Lurk proofs to be agnostic to names defined in the original Lean code).

And this would be all the data that comes out of the content-addressing routine:

structure Yatima.Store where
  irUnivAnon  : RBMap UnivAnonHash  IR.UnivAnon
  irUnivMeta  : RBMap UnivMetaHash  IR.UnivMeta
  irExprAnon  : RBMap ExprAnonHash  IR.ExprAnon
  irExprMeta  : RBMap ExprMetaHash  IR.ExprMeta
  irConstAnon : RBMap ConstAnonHash IR.ConstAnon
  irConstMeta : RBMap ConstMetaHash IR.ConstMeta
  irConsts    : RBMap ConstMetaHash ConstAnonHash

  lurkCache    : RBMap LDON F
  lurkUnivMap  : RBMap UnivAnonHash  (F × LDON)
  lurkExprMap  : RBMap ExprAnonHash  (F × LDON)
  lurkConstMap : RBMap ConstAnonHash (F × LDON)

  tcUniv  : RBMap F TC.Univ
  tcExpr  : RBMap F TC.Expr
  tcConst : RBMap F TC.Const

structure Yatima.Env where
  meta   : -- hold information about content-addressing, like hashing algorithms and versions
  consts : RBMap Name (ConstAnonHash × ConstMetaHash)

Yes, that's a lot of data! But it can be reused across different content-addressing rounds by FS persistence.

Yatima.Store is an evergrowing data structure that can be used to speed up consecutive content-addressing runs
Yatima.Env refers to an unique content-addressing pass
lurkCache speeds up the Poseidon hash generation for repeated LDON nodes
lurk<...>Map is what we would use to build the final Lurk code. While we can't bootstrap stores from the FS with lurk-rs, our Lurk code will look like:
```
(begin
(commit <LDON node1>)
(commit <LDON node2>)
...
((open <tc hash>) (open <term hash>) (open <type hash>)))
```
tc<...> data is used to generate the data that the typechecker consumes (in Lean!)

The CLI API we want looks like this:

yatima ca <Lean 4 source> -s foo.ystore -e myenv.yenv
yatima tc <some Lean 4 declaration> -s foo.ystore -e myenv.yenv

Revamping primitive CIDs detection

Currently, the detection of primitive operations is done in the content-addresser, which makes the outcome (their indices in the array of constants) dependent on the source being content addressed. This solution also forces us to know such hashes before we even start content-addressing arbitrary Lean code.

We should leave this identification as a self-contained task in the typechecker. We can keep the match-by-hash strategy with the help of the new yatima pp command.

Final thoughts

There's no need for a extractor/converter anymore! Our content-addresser chews up everything for us
We will finally be able to implement our .yatima_store idea to persist the Yatima.Store in the FS (using LightData)
All of this requires LDON working fully in lurk-rs and mirrored in Lean 4. So that's a priority!

cognivore commented 1 year ago

I understood everything except for why the solution solves something and some other things.

Why is the solution solution?

I assume that because the binders side-step the 🍝 situation with Lurk stores (untangling them)?

For a person who didn't go into the details of how Lurk works yet, but only has a surface understanding, it would be nice to learn more about it.

Why is it expensive to hash?

Keep in mind that encoding data and generating hashes like Lurk does is expensive and consumes a lot of memory, potentially GBs even for small/medium sized code bases

What's the bottleneck? Pardon my ignorance, but I think that it has to be clarified for those who don't have deep understanding.

Are IR.*Anon objects significantly faster to generate? I assume so...

Commits and sizes

You say "it's a lot of data" and mention FS persistence. But does it mean that these commit <LDON node> calls will have to be as big in size as the sum of all the persisted mappings?

arthurpaulino commented 1 year ago

@cognivore I believe the proposed solution is indeed a solution because the typechecker would dereference universes, expressions and constants by looking up a hash that coincides with the hashes that the Lurk evaluator works with. And we know their respective LDON objects so we can ensemble the final Lurk code with them before doing open

Hashing data as Lurk does is expensive because of the encoding:

It's time-consuming because symbols and strings are serialized/hashed char by char. This is mitigated by not encoding metadata as Lurk data, which is full of names
It's expensive in terms of memory because the smallest piece of data takes up 32 bytes. For example, the number 0 consumes 64 bytes because the number tag consumes 32 bytes and then the number itself consumes 32 bytes. The word "yatima" consumes 32 bytes for the strcons tag, then 64 bytes for the character 'y' (char tag + character itself), then 32 bytes for another strcons tag`, then 64 bytes for the character 'a'...

In contrast, IR data would be serialized with LightData, which is expressive and efficient.

About the commits, the LDON nodes are very cheap. The expensive part is knowing their hashes (type F in my text)

argumentcomputer / yatima

[RFC] Generating Lurk proofs of typechecking #236