Closed arthurpaulino closed 1 year ago
I understood everything except for why the solution solves something and some other things.
I assume that because the binders side-step the π situation with Lurk stores (untangling them)?
For a person who didn't go into the details of how Lurk works yet, but only has a surface understanding, it would be nice to learn more about it.
Keep in mind that encoding data and generating hashes like Lurk does is expensive and consumes a lot of memory, potentially GBs even for small/medium sized code bases
What's the bottleneck? Pardon my ignorance, but I think that it has to be clarified for those who don't have deep understanding.
Are IR.*Anon objects significantly faster to generate? I assume so...
You say "it's a lot of data" and mention FS persistence. But does it mean that these commit <LDON node>
calls will have to be as big in size as the sum of all the persisted mappings?
@cognivore
I believe the proposed solution is indeed a solution because the typechecker would dereference universes, expressions and constants by looking up a hash that coincides with the hashes that the Lurk evaluator works with. And we know their respective LDON objects so we can ensemble the final Lurk code with them before doing open
Hashing data as Lurk does is expensive because of the encoding:
strcons
tag, then 64 bytes for the character 'y' (char
tag + character itself), then 32 bytes for another strcons
tag`, then 64 bytes for the character 'a'...In contrast, IR data would be serialized with LightData
, which is expressive and efficient.
About the commits, the LDON
nodes are very cheap. The expensive part is knowing their hashes (type F
in my text)
What we've achieved so far
These are all amazing achievements, and now... onto the next level π
Current painpoints
Our current method to generate Lurk sources of typechecking rely on us hand-writing the
TC.Store
that feeds the typechecker. Hand-writing theTC.Store
instance is not only annoying, but also troublesome in some ways (elaboration can get stuck easily). That's not the only issue: the proof of typechecking becomes store-dependent. That is: we may typecheck the same constant, but if the constant is read from different stores then we will have a completely different Lurk proof. In order to address both points, we will need to change the approach.A plan of action
I will try to do a wrap up of everyone's ideas so that we can close the logical gaps in our solution. We've already agreed internally that we will use Lurk's
commit
/open
functionality to replace theTC.Store
in the final Lurk code. That is, we want to overwrite someTC.getConst (hash : F)
with a binder(|TC.getConst| (lambda (|hash|) (open |hash|)))
in Lurk. This implies two things:open x
will load the very same data as if it were the resulting representation of inductives (something like(CONS <ctor name> (CONS <ctor idx> ...))
)getConst
(and then the respective function, in Lurk, will consume Posiedon hashes)It implies that our typechecker has to work on top of something like our content-addressed anonymized data. Let me start with some draft of the types we might want.
This is similar to what we already have, but I'm removing the
Kind
tag so we don't have to carry it over and replicate on every node of every piece of data. The tags make the Lean code shorter, but also more complex. Also, replicating tags everywhere makes hashing (a little bit) slower and also increases the complexity of serialized data.Note that I'm using
ByteArray
to represent the hashes. That's becauseLightData
will hash toByteArray
(once https://github.com/yatima-inc/YatimaStdLib.lean/issues/60 is finished). Moving on.Still pretty similar to what we already have (except for the removal of
Kind
tags). Now, theTC
stuff would change a lot:Those inductives have the same shapes as the
IR.<...>Anon
types. The main difference is that here we use Poseidon hashes instead of other hashing algorithms. This will keep the complexity of encoding things as Lurk does restricted to where it's really needed. Keep in mind that encoding data and generating hashes like Lurk does is expensive and consumes a lot of memory, potentially GBs even for small/medium sized code bases. Important point: we don't need to produce this type of data for "meta" data... only for the "anon" portion of the IR data (we want the Lurk proofs to be agnostic to names defined in the original Lean code).And this would be all the data that comes out of the content-addressing routine:
Yes, that's a lot of data! But it can be reused across different content-addressing rounds by FS persistence.
Yatima.Store
is an evergrowing data structure that can be used to speed up consecutive content-addressing runsYatima.Env
refers to an unique content-addressing passlurkCache
speeds up the Poseidon hash generation for repeatedLDON
nodeslurk<...>Map
is what we would use to build the final Lurk code. While we can't bootstrap stores from the FS withlurk-rs
, our Lurk code will look like:tc<...>
data is used to generate the data that the typechecker consumes (in Lean!)The CLI API we want looks like this:
Revamping primitive CIDs detection
Currently, the detection of primitive operations is done in the content-addresser, which makes the outcome (their indices in the array of constants) dependent on the source being content addressed. This solution also forces us to know such hashes before we even start content-addressing arbitrary Lean code.
We should leave this identification as a self-contained task in the typechecker. We can keep the match-by-hash strategy with the help of the new
yatima pp
command.Final thoughts
.yatima_store
idea to persist theYatima.Store
in the FS (usingLightData
)LDON
working fully inlurk-rs
and mirrored in Lean 4. So that's a priority!