incremental checking - Githubissues

ivoysey commented 4 years ago

To allow us to use cooltt for developments of any kind of substantial scale, we need to have a support for incremental checking.

Comments and discussion below; rough draft collecting them together to be found in https://github.com/RedPRL/cooltt/blob/inc-checking/doc/ic-design.md.

RobertHarper commented 4 years ago

All fine, but we need more technical meat. My suggestion is to consider the design document for the TILT compilation manager as a general sort of inspiration (that's http://www.cs.cmu.edu/~rwh/papers/smlsc/mlw.pdf). At a high level I would suggest first of all a "declarative" formulation of incremental checking that specifies how components are processed into an internal form that needs to be defined, but must certainly admit a notion of identity. Then this can be augmented to a formulation that makes the cacheing explicit, in such a way that it should be possible to prove that the cacheing mechanism is correct in that it produces the same results as would the simple declarative form, albeit in a way that makes use of the cache (fundamentally, identity) to avoid redoing work that is "already done". A key ingredient would be assessing the validity of a component in terms of the validity of the components on which it depends, at the leaf level being a simple, direct up-to-date check between the source and "compiled" (elaborated, checked) form.

This is all by way of getting a serious formulation started. You guys should please chime in to help formulate what this would look like. Doing this form of specification was very helpful for the TILT implementation quite some time ago.

favonia commented 4 years ago

Yes, it will be great to see more details. We did not carefully formalize the redtt code on paper, but the most relevant parts are:

frontend/GlobalEnv
frontend/ResEnv

in frontend/Contextual. Semantically, these two form a mapping from names to internal forms, the most important part of an elaboration context. The other parts are to remember which files are already in the memory, etc.

The lib/ElabEnv in cooltt plays the role of the elaboration context as frontend/Contextual does in redtt.

ivoysey commented 4 years ago

OK! I will focus on getting something formal on paper in the style of the TILT paper ASAP, then we can iterate on that until we're happy, and then go from that to an implementation.

ivoysey commented 4 years ago

@cangiuli is this a fair summary of your idea we talked about in the winter?

i think that in the language of this paper, in practice, we're really just initially going to have incremental computation not separate compilation. at least today, the language of cooltt doesn't have a notion of explicit interfaces---therefore the interface of every unit must be inferred from its source rather than from an interface.

you and i had talked about restricting access to the internals of another unit (having abstraction) to limit the depth of normalization; that would add interfaces, and therefore at least some form of a module system.

i believe that's something that i should be careful to allow growth for, but not really worry about right now?

RobertHarper commented 4 years ago

The TILT design accommodates both incremental recompilation and separate compilation, and is thus much more than would be required here. The difference lies in exactly whether interfaces are derived or specified. I would agree that what we want immediately is incremental recompilation. I cite the TILT design as a model for how one might want to specify what is going on, and provide a framework for analyzing whether it is doing the right thing or not.

RobertHarper commented 4 years ago

It would be a fun project to think about what separate compilation might look like in this setting, but that's not on the critical path.

BTW, Appel wrote a series of papers called "Smart Compilation", "Smarter Compilation", and, you guessed it, "Smartest Compilation" that are all germane. I recall the term "cutoff compilation" being used around that time too, perhaps by Shao, but I don't remember right now. Obviously these were all in the context of SML compilers, but the general ideas are similar, and the semantic methods seem applicable, if considerably simpler, here.

ivoysey commented 4 years ago

@RobertHarper That sounds like the impression I had, but I'm glad to confirm it. I just don't want to set myself up so that if, in a month or two, we decide that we want to consider SC in this setting (possibly as a means to the end Carlo had in mind), then I have to overhaul something because what I did now wasn't extensible to a medium term goal we knew we had.

I'm not looking to get caught up in figuring out SC right now, but I will try to leave the door open to it, and certainly agree that IC is the first step.

cangiuli commented 4 years ago

@ivoysey In the winter I was talking about possible refinements of what can be hidden at an interface boundary. Part of why we want to look over your design proposal before you start is to make sure it is reasonably compatible with some ideas we might have in the future, including this. I believe you can safely ignore this question for now.

Something that will come up is that cooltt doesn't have an import mechanism (#68) yet. I imagine we will start with a very rudimentary one, as in redtt. Nevertheless, because incrementality comes up at the import boundaries, you may want to make some decisions about how importing should work, and you will have to think about how your mechanism will handle, say, (1) importing multiple files, (2) an imported file itself importing files, (3) the diamond problem. Feel free to include assumptions or requests about the import mechanism in your design as necessary.

RobertHarper commented 4 years ago

Yeah, I intended to mention the diamond problem, when multiple components import the same component. This is pretty much the common case, libraries being what they are.

ivoysey commented 4 years ago

@cangiuli Yeah, I made a reference for myself to the RedTT issue with thoughts from @favonia on what a very powerful import syntax might look like (I think that's https://github.com/RedPRL/redtt/issues/449). I don't really know if that's where we want to go with cooltt but it seems like it ultimately might be.

I think the assumption that I should make for now is basically "there is an import mechanism!". It happens to be the case that this assumption is not satisfied right now, but that's kind of fine. The purpose of such a mechanism from the point of view of designing an IC mechanism is that it somehow specifies which other units are dependencies, but I think the actual syntax of how it works is a little bit out of scope / can be treated as a black box.

In terms of implementation (i.e. #68) I think that to get off the ground it makes sense to start off with something that's very very simple and allows not much more than specific definite references like import Nat for a unit called Nat that contains a bunch of definitions about natural numbers. As long as that's added to the code base in an elegant way, I think it should be easy to extend the language of imports to include more expressive over time as we want to.

cangiuli commented 4 years ago

Yes, but the details of imports can affect your design. For example, what are the names of the imported identifiers? (Are they qualified?) Do we need to keep track of whether an identifier comes from this file or an imported file? Re: the diamond problem, do we need to make sure that all references to an imported identifier are judgmentally equal? What happens if identifiers are shadowed?

Some of the answers may be clear, but there is a lot to consider.

RobertHarper commented 4 years ago

The issue of naming imports can't be completely ignored. The difficulty is how to name the thing you are importing? A very naive approach is to use file names, which is a very bad idea. Another bad idea (used in OCaml against our criticism, perhaps because of it) is to assume that unit names are file names or can be easily mapped back and forth. No. Not even close. Forget about it. What we did was to have a map file that associates abstract names, say Nat, to concrete names, say "lib/nat.cooltt". This is at least not insane. In this setup all matters of hierarchy or versioning would be handled by the map file, and that's not too bad. Maybe later, but not now, some internal form of hierarchy could be considered.

My main point is that all of this requires careful thought. We can and should be simple, but we should not be simplistic, or make sweeping assumptions about "something else handles this problem". No. We should consciously decide on what we are doing and make that explicit.

RobertHarper commented 4 years ago

(Begin diatribe)

Another way to put the matter of the diamond import is that the very definition of what is a diamond makes implicit use of a notion of equality. The diamond diagram doesn't reveal this, because it just points two arrows to one node. But what is meant by "one node"? Well, to be accurate, it's "two nodes that are equal".

This mode of thinking is essential, and is why I absolutely despise handwavy diamond diagrams and suchlike that are so beloved by so many people. They do not meaningfully address the problems; what you need is semantics.

(End of diatribe.)

ivoysey commented 4 years ago

Ideally, I would like to be both divorced from the madness of the file system and also not require a user to recapitulate a bunch of information that is more or less already present.

A mapping file certainly breaks the link to the file system, but it also can create maintenance overhead because in a well-organized development a lot of it isn't going to be very surprising. It's nice to have that overhead localized to the mapping file rather than in each source file when something changes in the layout on disk, but it's a sausage squeezing exercise to some extent: source code lives in files on disk before it's parsed, and we are always going to need to know where to find it eventually.

I'm not exactly sure how to do this, or if both are possible at the same time, but this is my current tentative proposal:

I want to have a new syntactic object of unit paths (name very negotiable; I'm really trying to avoid the word "module") given by p ::= s | s.p. s here is a string that's restricted to some reasonable language -- obviously it can't contain . or probably .., \, and / as well; [A-Za-z0-9] would be safe if also a little boring.

For now, we would default to resolving a unit path in import a.b.c as the file system path a/b/c.cooltt relative to the root of the library. This lets us refer indirectly to just the subset of file system paths that make sense -- don't have bizarre .. cycles in them, etc. -- and also push the details of dealing with delimiters and other OS specific aspects to libraries in the implementation. This is a specific instance where I feel confident about deferring to such libraries rather than rolling our own.

We can do something in the future that's a little bit more sophisticated, e.g. add support for a mapping file that takes precedence over the default resolution as a safety mechanism for edge cases that are outside the common use for some unanticipated reason.

I talked to @cangiuli a little bit about this and as a result of that discussion I believe this is basically what was done successfully in RedTT. @RobertHarper does this strike you as a reasonable balance, or is it too naïve?

RobertHarper commented 4 years ago

Some concerns:

File system naming conventions vary between OS's, between versions, between settings. For example it can be that Nat = NAT = nat in a file system, even if we don't consider them the same. And who knows what happens with "special characters" or Unicode.
It's not clear to me that we want hierarchies within the checker's purview or not. I can see arguments either way. For immediate purposes I would think a "flat" namespace internally is just fine, and a mapping file would manage directory structure and versioning, say.
In some sense what we are doing is implementing variables using assignables, for the sake of efficiency. Internally I might think of Nat as a particular unit with particular components, and it is a nice mathematical "thing". Externally, what is bound to the variable Nat is determined by a mutable reference whose contents can "change", requiring rebinding of variables and corresponding reprocessing. I'm not sure what this all implies for the question, but it's likely important to distinguish labels (component names) from variables (bound to units) from assignables (with mutable contents).

I'm open to discussion.

favonia commented 4 years ago

I think we should focus on the necessary changes to the elaboration context, using the simplest mapping between files and units as a start. The design document is to help organize thoughts, not a reason to delay coding.

cangiuli commented 4 years ago

I think this is a separable part of the design -- as long as we have an abstract notion of "unit paths" and a way of mapping them to files, we can always refine the possibilities later. For example, I think we could get some distance with a very simplistic design: a flat namespace with lowercase alphabetic names that are mapped 1-1 to files in the same directory with .cooltt suffix.

We can continue to discuss this topic, of course, but I think it is much more important that we start to address how the elaborator is affected by the incremental compilation, as @favonia says.

ivoysey commented 4 years ago

That sounds good to me. I agree that this is separable and not the main question of the design or a major road block to getting some code written; I mostly wanted to get folks' thoughts on it because it came up and I know it can be a real thorn in the side.

ivoysey commented 4 years ago

I have made myself a branch that will be what I eventually make a PR against master with. For now the interesting thing is that it includes a markdown file I'm working on to collect my thoughts: https://github.com/RedPRL/cooltt/blob/inc-checking/doc/ic-design.md I hope that ASCII and Unicode art will be good enough for working now, but I plan to TeX this into a nice summary once it's all figured out (and, honestly, maybe once it's implemented too).

Please understand that this is a very rough draft with big holes in it that I'm actively working on. If you skim it and think to yourself "this is idiotic madness!" -- you are probably right! I'm certainly open to comments on it if anyone feels like it, but please do not think that it's at a state where it really is something worth a ton of other people's time to edit and refine yet.

I've edited the text at the top of this issue to reflect this. We can discuss here, or in a long-life PR, or in commit comments or whatever; it's all more or less the same to me.

favonia commented 4 years ago

Let's split this issue into "import" and "serialization".

RedPRL / cooltt

incremental checking #110