au-ts / cogent

Cogent Project
https://trustworthy.systems/projects/TS/cogent.pml
Other
158 stars 26 forks source link

Antiquoted C Compilation Performance Optimisation #322

Open zilinc opened 4 years ago

zilinc commented 4 years ago

Description

Cogent's type system is structural, meaning that two types that are structurally the same are considered definitionally equal. Types don't have names, as in a nominal type system; type synonyms are just a syntactic shorthand for the types they represent, and have no semantic significance.

All the backend languages (Isabelle/HOL, C, Haskell) of Cogent have nominal type systems. Thus we need to generate names for these types. If we generate stable names according to the structure of the types, they will be very lengthy names (e.g. for a BilbyFs data structure the name can be ~5000 characters long). We want to have short and stable names.

Currently, the generated names are not stable. The Cogent compiler creates new names for types, monomorphised functions, etc. as it traverses the syntax tree. It means that, if the definitions get reordered, or other minor changes are made, the name mapping could potentially be different. The antiquoted C mechanism is for rectifying this discrepancy. Unfortunately, it makes the compilation process really computationally heavy and takes long time to complete. Also Isabelle and Haskell embeddings do not benefit from antiquoted C, meaning that if the names change, the proof engineer will have to modify the (manually written) proofs/code manually.

Our current solution to this problem is, we create a "name cache" for a Cogent program the first time it is compiled, and subsequent compilations will just fetch names from the cache. In this way the names are relatively stable, as long as the cache file is kept around.

Once stable names can be generated, antiquoted C can be greatly simplified to a name lookup, as opposed to a full compilation, as it's currently doing, thus greatly improves the performance of the compiler.

Progress

Experiments

I managed to run profiling, and the source of the long compile time is mostly (over 90%) from the traversal of the C syntax tree while processing antiquoted C files using generic programming (SYB). It's also noted by the https://michaeldadams.org/papers/tyb/tyb-2012-haskell.pdf paper that SYB can be 10-100x slower than handwritten code. But since the language-c-quote library (https://hackage.haskell.org/package/language-c-quote-0.12.2.1/docs/Language-C-Syntax.html) has Data instances defined, I don't see an engineering cheap way moving to a different generic library. Hand crafting the traverse function is in general a bad idea, after all that's why people have invented generic programming. However if we plan to move away from parsing, and use tokenisation instead, then manually writing a traverse function will be easy, as it has only one type Token. It will require a careful redesign of the syntax and substitution semantics about the antiquotes. We might need to write our own tokeniser instead, as we define new syntax. (@zilinc 25/11/19)

Investigation

The current implementation traverses the AST multiple times, based on the antiquotes it's working on. For each different type of antiquotes, it's done in a separate traversal. This way the code is very clean and principled, but very inefficient. For better performance, we need to find a way to "fuse" these traversals. A naive fused version would be a handwritten one which handles different things all at once (will be terribly ugly though). I personally don't know if syb or other packages are capable of doing it "for free". The tokenising-only approach natively resolves it because it has only one datatype to traverse over.

Resolution

A new C AST traversal scheme has been implemented as of adf02fac31da92, and results in a 7-10x speed up and around 1/7 RAM usage. User interfaces of antiquoted C code remain the same.

Related Tickets

Antiquoted C: #318 #286