Open braxtonmckee opened 1 year ago
I see how to do this. Proceeding in this direction might also allow us to parallelize some aspects of compilation, which would be nice because currently it is quite slow. Here is a rough sketch:
Currently, we have one notion of 'identityHash', where we walk a TP object graph and hash everything that we would expect the compiler to look at in that object. This includes the source code for all functions. This notion of identityHash allows us to state that two functions with the same identityHash will compile to identical code, since we're including all the information that the compiler could possibly look at. The compiler uses this to build its compiler cache - when we see a function, we first identityHash it and its argument signature, and look to see if that is in the cache, and if so, we know we can safely load the code we previously compiled from the cache.
Now, imagine you have an entrypointed function 'f(x) -> int' and you change its body from 'return g(x)' to 'return h(x)'. The crucial point is that when we consider the fully typed graph of functions, nothing upstream needs to change. At the LLVM level, of course, the definition of 'f' has changed completely. However, if we had compiled a collection of functions that call 'f', it should be possible to recompile only 'f', and then link the upstream functions to the compiled version of 'f', which is possible because its function signature has not changed.
Of course, if we do that directly, it wouldn't be possible to inline 'f' into its callers. To solve this, we need to allow the system to execute in two phases. In the first phase, it needs to break the graph apart into bundles of code that could in theory be linked together if we wanted to - imagine a set of maximally split-up graph components which link together along the boundaries of functions whose type signatures don't depend on their bodies. In the second phase, we walk over the graph nodes and decide what to inline.
To implement this, we'll need to build an alternative notion of 'identityHash' called 'signatureHash'. The point of 'signatureHash' is to provide a hash function like identityHash but which doesn't look inside the function internals of fully-typed functions. Two objects with the same signatureHash should produce the same portion of the fully-typed compiler graph assuming we don't ever cross over a function whose signature was explicitly specified.
In order for a function to be considered 'fully signed', we need a way of uniquely identifying it. This unique identifier will determine how we link two portions of the graph. For instance, a named, module-level function can be identified by the name of the function and the module. Similarly for a method of a named class, or something coming out of a TypeFunction. In principle, we could give unnamed lambda functions a unique id that's a function of their signature and the scope in which they were defined in a manner similar to how we compute qualified names. The crucial point is that we need a stable identifier that can be used across program invocations that's reasonably stable under editing.
Once we have a proper identity hash, we can build a type-inference model that fleshes out the graph up to stable function signatures. Given a function and a collection of argument types (a signature), we can follow through the graph up to the ends of the current stable signature, and cache this. This code will be equivalent to the definitions we already use, except that functions will be named (and therefore linked) according using the signature hashes. This means that a given linker name can have different meanings for two different codebases.
In our second pass, after we've correctly identified the fully-typed bodies of all reachable nodes in the current program, we can then perform an inlining calculation, gradually enlarging groups of code to include the bodies of small functions. This can also be cached as a function of the hash of the information used to make the decision, so that its local.
Now that we have a complete graph of definable nodes with full function bodies, we can proceed as we normally do. A simple implementation can use the standard linker, so that we simply load the relevant shared object libraries. This should be fine as long as we are careful to ensure that each shared object contains a single linker entry.
A better implementation would actually allow us to link functions ourselves, so that we can control how functions are linked. This would have the added benefit of giving us the ability to re-link the graph in the future as new definitions become available. For instance, we might be able to re-optimize a function with additional profile-guided optimizations.
We should be able to get the compiler cache to decompose compilation and sha-hashing in a way that allows us to modify the code of a function and not have to recompile everything upstream from it.
Its not clear to me how to do this exactly, but it bears thinking about, because compilation is so expensive. Perhaps it needs to be opt-in using a decorator (this would be much easier since we'd know to dynamically link to the function)