Antiquoted C Compilation Performance Optimisation

Description

Cogent's type system is structural, meaning that two types that are structurally the same are considered definitionally equal. Types don't have names, as in a nominal type system; type synonyms are just a syntactic shorthand for the types they represent, and have no semantic significance.

All the backend languages (Isabelle/HOL, C, Haskell) of Cogent have nominal type systems. Thus we need to generate names for these types. If we generate stable names according to the structure of the types, they will be very lengthy names (e.g. for a BilbyFs data structure the name can be ~5000 characters long). We want to have short and stable names.

Currently, the generated names are not stable. The Cogent compiler creates new names for types, monomorphised functions, etc. as it traverses the syntax tree. It means that, if the definitions get reordered, or other minor changes are made, the name mapping could potentially be different. The antiquoted C mechanism is for rectifying this discrepancy. Unfortunately, it makes the compilation process really computationally heavy and takes long time to complete. Also Isabelle and Haskell embeddings do not benefit from antiquoted C, meaning that if the names change, the proof engineer will have to modify the (manually written) proofs/code manually.

Our current solution to this problem is, we create a "name cache" for a Cogent program the first time it is compiled, and subsequent compilations will just fetch names from the cache. In this way the names are relatively stable, as long as the cache file is kept around.

Once stable names can be generated, antiquoted C can be greatly simplified to a name lookup, as opposed to a full compilation, as it's currently doing, thus greatly improves the performance of the compiler.

Progress

[x] If the --name-cache=<FILE> argument is given, every time the compiler generates code, it will use <FILE> as the name cache (create it if it doesn't exist). It will guarantee that, if the same cache file is used, then the name generation will be stable.
[ ] That alone doesn't solve the performance problem though. The next step will be to optimise the antiquoted C compilation. When possible, it firstly tries to fetch the name from the cache. If it fails for whatever reason (e.g. it's a polymorphic type) then it defaults to the old compilation process. But even in the cache hit case, it will have to first go through the surface typechecker and desugarer, which are the most heavy phases, resulting in insignificant performance gain.
[ ] A minor problem with the name cache is that, it only adds more entries to it, which means if the same cache is used over and over, it's likely that it will generate C types that are no longer needed and end up with redundancy. A clean up function should be implemented to clean up unused C types.
[ ] It may be desirable to implement an antiquoted-C compilation function, which uses the name cache to substitute in C names (without compiling a Cogent program). Since the names are stable, development can be done on top of this processed .ac file. The caveat is that, the .ac files will become a giant one comprised of all the individual files, as parsing requires C files to be preprocessed (so that the type declarations are known to the parser). The problem with this approach is that, the type antiquotes still need to go through the parsing, surface Tc and desugaring phases. The name cache itself doesn't have enough information for doing all these. They still require the expensive state passing operations.
[ ] Instead of parsing the .ac files, it'll be better to simply tokenise them, which is more lightweight. It just means that we'll have less structural information about the C templates, which may require a more clever algorithm to deal with the antiquotes.

Experiments

I managed to run profiling, and the source of the long compile time is mostly (over 90%) from the traversal of the C syntax tree while processing antiquoted C files using generic programming (SYB). It's also noted by the https://michaeldadams.org/papers/tyb/tyb-2012-haskell.pdf paper that SYB can be 10-100x slower than handwritten code. But since the language-c-quote library (https://hackage.haskell.org/package/language-c-quote-0.12.2.1/docs/Language-C-Syntax.html) has Data instances defined, I don't see an engineering cheap way moving to a different generic library. Hand crafting the traverse function is in general a bad idea, after all that's why people have invented generic programming. However if we plan to move away from parsing, and use tokenisation instead, then manually writing a traverse function will be easy, as it has only one type Token. It will require a careful redesign of the syntax and substitution semantics about the antiquotes. We might need to write our own tokeniser instead, as we define new syntax. (@zilinc 25/11/19)

Investigation

The current implementation traverses the AST multiple times, based on the antiquotes it's working on. For each different type of antiquotes, it's done in a separate traversal. This way the code is very clean and principled, but very inefficient. For better performance, we need to find a way to "fuse" these traversals. A naive fused version would be a handwritten one which handles different things all at once (will be terribly ugly though). I personally don't know if syb or other packages are capable of doing it "for free". The tokenising-only approach natively resolves it because it has only one datatype to traverse over.

Resolution

A new C AST traversal scheme has been implemented as of adf02fac31da92, and results in a 7-10x speed up and around 1/7 RAM usage. User interfaces of antiquoted C code remain the same.

Related Tickets

Antiquoted C: #318 #286

au-ts / cogent