ForNeVeR / Cesium

C compiler for the CLI platform

MIT License

378 stars 39 forks source link

Improve the compilation pipeline (introduce stages) #201

Open ForNeVeR opened 2 years ago

ForNeVeR commented 2 years ago

For now, we have somewhat messed up pipeline: the IR nodes do all by themselves. Type checking, member resolve, lowering, codegen: we have everything everywhere. Right now, only parsing is properly isolated from everything else.

We should introduce more formal stages of compilation and maybe even create several different layers of IR? For example, we have certain constructs that have to be lowered (such as +=-styled operators). We may get rid of these constructs in the "lowered IR layer" and thus decrease the amount of type checks required in the codegen.

I don't know how should it be organized, though. Just start from separating the IR layers, and everything else will click and fit in place?

Thoughts?

cc @kant2002, @impworks

When implementing, look for number 201 in the source and try to eliminate every instance of that number.

impworks commented 2 years ago

I would suggest having the following compilation stages:

Parsing: turns raw input into AST. The AST at this stage may be ambiguous.
Resolution and lowering: performs all type checks, resolves identifiers, etc. This level knows nothing about .NET type system and only operates on types defined in the C spec. Depending on the semantics, each AST node is mapped to some CST node which is unambiguous and desugared (e.g. implicit casts are inserted, shorthand assignments expanded, etc.)
Emit: CST nodes are emitted to Mono.Cecil structures and MSIL opcodes. C types are mapped to .NET types.

Not sure about the optimizations thought. It might be wise to add a special step between 2 and 3, which will fold constants, eliminate dead code, unroll loops etc. - or it might be simpler to just do that kind of processing directly in the "emit" stage.

The visitor pattern would not probably help with this: it's a good solution for the "expression problem" when the set of nodes is finite, and the set of operations is not (e.g. can be extended by the user). Since we do not intend to provide any kind of API for compiler plugins, a simple OOP-based solution will work just fine.

kant2002 commented 2 years ago

I agree with @impworks There some points also very important for me

This level knows nothing about .NET type system and only operates on types defined in the C spec
"The visitor pattern would not probably help with this". I do think that visitor pattern is overkill at least for now, until we figure out problem space better.

Other parts which I think important in regarding to CST

Basic blocks for CFG. If we split execution graph on Basic blocks that simplify our branching structure. Only BB would know about branching, other parts of CST would be linear. That's usually first part of transformation.
I would like to see low-level subset of CST as part of our model. Basically something which mapped much better to stacked VMs. I do not sure what exactly would be part of that subset, but it's should be part of design that emit can happens only for these CST nodes. Overall that would be fun experience. 3 In order to support item 2, we need some phase which will simplify CST to low-level CST. Jit calls this morphing, if I understand things correctly.

Basic blocks has some parts which I do not think through (nested scoped and labels in them), but that's not super important for now. Other complicated transformation as separate phase is also something to keep in mind and working towards. I believe first priority should be separation of Cecil and AST. Also explicit type resolution in C-types is also very important. NamedType is example how intertwined things right now.

kant2002 commented 2 years ago

That's currently proposed pipeline, with reuse of existing infrastructure.

graph TD;
    Preprocessor(Preprocess code)
    Lexems(Lexems split)
    SyntaxTree(Syntax tree building)
    BasicBlock(Basic block extraction)
    StatementLinearization(Statement linearization)
    ConstantFolding(Fold constants)
    Lowering(Convert syntax tree to low-level syntax tree)
    Preprocessor-->Lexems;
    Lexems-->SyntaxTree;
    SyntaxTree-->TypeResolution;
    subgraph SyntaxCheck ["Syntax check #13;#13;"]
        TypeResolution-->IdentifierResolution;
    end
    subgraph Transformations ["Trasnformations #13;#13;#13;#13;"]
        IdentifierResolution-->StatementLinearization-->ConstantFolding-->BasicBlock-->Lowering;
    end

    Lowering-->Emit;

Preprocess code

This part give us clean C code without any preprocessor. Right now this is served with CPreprocessor

Lexems split

This stage is basically Yoakke, but I think we should merge tokens for preprocessor and for compiler.

Syntax tree building

Creation of syntax tree. This is handled mostly by Yoakke for us and stored in CParser

Syntax check

This is part which is not very well thought out on my part, and type resolution and identifier resolution is just to be explicit that we need these processes, and current order does not indicates that it would happens in exact same order. Seems to be @impworks in better position to specify order on this things.

Statement linearization

This stage should remove all complicated statments like if/while/switch etc. and replace then with goto. That's allow us to have next step.

Basic block extraction

This allow us to have basic block over which we can perform reachability analysis for example.

Fold constants

Precalculate constant expressions. That's purely to show capabilities.

Convert syntax tree to low-level syntax tree

Here we should remove even more parts of IR. For example we can annotate pointers as pointing on stack or on the heap. Remove subscription indexing from the tree. For example I think that ptr[10] = some_expression equivalent to

tmp_val = ptr + 10;
*tmp_val = some_expression;

Emit

that's actual emit to IL, at this state this should be something trivial.

ForNeVeR commented 2 years ago

The proposal looks good to me in general, but I'd like to leave some comments.

This stage is basically Yoakke, but I think we should merge tokens for preprocessor and for compiler.

The standard keeps the tokens separated from each other. I think we should follow the standard (for easier verification and for our pipeline to be "correct by design" as much as possible).

kant2002 commented 2 years ago

I think we should follow the standard

I think we have problem with this. Consider error reporting in case of #include usage. Even now we miss important information about where each line comes from. Same would be for macro expansion. If macro is produce broken code, I would like to point to place in macro declaration where error happens. That would be easier to do with tokens. Yoakke require additional work to support our needs. Probably it's up to us contribute that support if it's not there now. I ask author and he agree with me that tokens need to be tied to source file location.

Also take a look at #if DEBUG > 1. Inside these #if constructions I can create arbitrary constant expressions. Do we really want two separate parsers for that constructions? Macro invocation is very similar to function call.

ForNeVeR commented 2 years ago

If the shape of our AST doesn't follow the standard, then it is much harder to maintain and upgrade. So I would still stick to the standard instead of inventing something on our own, with new exciting possible parser conflicts.

I can't see how combining some parser or lexer tokens helps (or makes harder) to provide exact source info for these tokens.

We should introduce source information for tokens and their origins, yes. But why/how does it limits us to keeping a combined or separate ASTs for preprocessor and C?

Macro invocation is very similar to function call.

This is not a good example, in my opinion, since IIRC there are a lot of special rules when macros are recognized as calls and when they aren't.

cpp is a separate tool for a good reason.

kant2002 commented 2 years ago

I believe rely on cpp as evidence how we should implement preprocessor is bad for developers. Consider this experiment.c file

#define MAIN printf("HELLO")

int main(void) { MAIN }

I deliberately miss ; after MAIN. CL output

Cesium.IntegrationTests/experiment.c(3): error C2143: syntax error: missing ';' before '}'

Cesium output

Unhandled exception. Cesium.Core.ParseException: Error during parsing Cesium.IntegrationTests/experiment.c. Error at position line 1, column 34. Got }.

As you see, but compilers provide obscure output. But CL is much better in the pointing at the actual line with error. Cesium not only unhelpful, he is actually lying about line number because that's preprocesor produce following output

int main(void) { printf("HELLO") }

We completely lost information about lines after preprocessing. I do not see a way how to solve that without merging parsers for C and C Preprocessor languages. Yoakke itself implement preprocessor based on shared token stream, even if pre-processor implementation itself slow moving. For different reasons obviosuly.

ForNeVeR commented 2 years ago

Your example demonstrates that we should have a way to preserve the source information after preprocessing, and I very much agree with that.

I still very strongly disagree with the idea of changing the grammar. We have a lot of hacks already, and it is very hard to reason about the parser: how does it correspond to the C standard? Does it have any parser conflicts? What we should change to migrate to C23 after it emerges?

And these questions will quickly become impossible to answer if we change the parser completely, inventing some unholy hybrid of C grammar and C preprocessor grammar. Moreover, I cannot see how it helps to achieve anything: this combined grammar won't have any source information embedded, either.

I believe that the preprocessor should generate some kind of annotated result (so we know from where each token comes). If this was (part of) your point, then I agree. Simple plain text output from the preprocessor won't work, and I agree on that, too.

Unfortunately, Yoakke isn't able to work with such token streams out-of-the-box, I believe. We may choose either to migrate to some other library (or a manual parser), or provide a bridge between the preprocessor-generated text and the source. One possible way to preserve the source information without changing Yoakke I imagine is the following:

Parse the source file to a stream of the preprocessor tokens (note that each token still have the source information: in-file position and range, but there's no way to distinguish different files).
Preprocessor should provide a string content and a mapping from each text token of this content to the original source. I.e. (string, Dictionary<TextRange, SourceInformation>). Of course, there may be some peculiarities involved when we are trying to determine "the origin" of a C token created by the preprocessor (since the token may be glued together by several different macros), but that's in any case a question for our interpretation.
When parsing the file in C language, we receive the text ranges from the parser. These ranges then may be remapped to the original source code using the Dictionary<TextRange, SourceInformation>, because at that point, both text ranges (from the dictionary and from the C parser) operate on the same source: the text clob from the preprocessor output.

kant2002 commented 2 years ago

If this was (part of) your point, then I agree.

That's most important part for me. Second part which I try to express is that we have IPeekableStream<CToken> and CPreprocessor is just large IPeekableStream<CToken> Map(IPeekableStream<CToken> source) which result get fed into CParser(IPeekableStream<CToken> source) , but I will back this option only if that's very easy and do not put congnitive load.

Unfortunately, Yoakke isn't able to work with such token streams out-of-the-box, I believe.

SourceInformation hopefully coming to Yoakke https://github.com/LanguageDev/Yoakke/pull/150#issuecomment-1236735148 My current understanding that this would be enough for us. and Yoakke help us a lot, so why not contribute back. If something is missing, it's good time to provide feedback if you think this is proper direction.

One possible way to preserve the source information without changing Yoakke I imagine is the following:

It seems to be we are on the same page, and only difference is that I think that we should fix Yoakke, but you consider workarounds to existing Yoakke.

ForNeVeR commented 2 years ago

To me, it's not a big deal whether we do it in this repo or contribute to Yoakke. The latter is a bit more complicated because we'll need to invent abstractions useful for other people as well for ourselves. But still, doable.

ForNeVeR commented 9 months ago

So far, I am thinking about separating the stuff into approx. three levels of abstraction.

C code (in which arrays are just arrays, for example)
.NET code (in which all C types are lifted to the .NET level, i.e., an array is either a pointer or a fixed-buffer type, but there are still nodes with expressions)
The stack model of the VM (where there are no nodes for expressions or statements, but there is knowledge of the state of the stack at any given time, and this knowledge can be used for optimizations)