Open ForNeVeR opened 2 years ago
I would suggest having the following compilation stages:
Not sure about the optimizations thought. It might be wise to add a special step between 2 and 3, which will fold constants, eliminate dead code, unroll loops etc. - or it might be simpler to just do that kind of processing directly in the "emit" stage.
The visitor pattern would not probably help with this: it's a good solution for the "expression problem" when the set of nodes is finite, and the set of operations is not (e.g. can be extended by the user). Since we do not intend to provide any kind of API for compiler plugins, a simple OOP-based solution will work just fine.
I agree with @impworks There some points also very important for me
Other parts which I think important in regarding to CST
Basic blocks has some parts which I do not think through (nested scoped and labels in them), but that's not super important for now. Other complicated transformation as separate phase is also something to keep in mind and working towards. I believe first priority should be separation of Cecil and AST. Also explicit type resolution in C-types is also very important. NamedType is example how intertwined things right now.
That's currently proposed pipeline, with reuse of existing infrastructure.
graph TD;
Preprocessor(Preprocess code)
Lexems(Lexems split)
SyntaxTree(Syntax tree building)
BasicBlock(Basic block extraction)
StatementLinearization(Statement linearization)
ConstantFolding(Fold constants)
Lowering(Convert syntax tree to low-level syntax tree)
Preprocessor-->Lexems;
Lexems-->SyntaxTree;
SyntaxTree-->TypeResolution;
subgraph SyntaxCheck ["Syntax check #13;#13;"]
TypeResolution-->IdentifierResolution;
end
subgraph Transformations ["Trasnformations #13;#13;#13;#13;"]
IdentifierResolution-->StatementLinearization-->ConstantFolding-->BasicBlock-->Lowering;
end
Lowering-->Emit;
This part give us clean C code without any preprocessor. Right now this is served with CPreprocessor
This stage is basically Yoakke, but I think we should merge tokens for preprocessor and for compiler.
Creation of syntax tree. This is handled mostly by Yoakke for us and stored in CParser
This is part which is not very well thought out on my part, and type resolution
and identifier resolution
is just to be explicit that we need these processes, and current order does not indicates that it would happens in exact same order. Seems to be @impworks in better position to specify order on this things.
This stage should remove all complicated statments like if
/while
/switch
etc. and replace then with goto
. That's allow us to have next step.
This allow us to have basic block over which we can perform reachability analysis for example.
Precalculate constant expressions. That's purely to show capabilities.
Here we should remove even more parts of IR. For example we can annotate pointers as pointing on stack or on the heap. Remove subscription indexing from the tree. For example I think that ptr[10] = some_expression
equivalent to
tmp_val = ptr + 10;
*tmp_val = some_expression;
that's actual emit to IL, at this state this should be something trivial.
The proposal looks good to me in general, but I'd like to leave some comments.
This stage is basically Yoakke, but I think we should merge tokens for preprocessor and for compiler.
The standard keeps the tokens separated from each other. I think we should follow the standard (for easier verification and for our pipeline to be "correct by design" as much as possible).
I think we should follow the standard
I think we have problem with this. Consider error reporting in case of #include
usage. Even now we miss important information about where each line comes from. Same would be for macro expansion. If macro is produce broken code, I would like to point to place in macro declaration where error happens. That would be easier to do with tokens. Yoakke require additional work to support our needs. Probably it's up to us contribute that support if it's not there now. I ask author and he agree with me that tokens need to be tied to source file location.
Also take a look at #if DEBUG > 1
. Inside these #if
constructions I can create arbitrary constant expressions. Do we really want two separate parsers for that constructions? Macro invocation is very similar to function call.
If the shape of our AST doesn't follow the standard, then it is much harder to maintain and upgrade. So I would still stick to the standard instead of inventing something on our own, with new exciting possible parser conflicts.
I can't see how combining some parser or lexer tokens helps (or makes harder) to provide exact source info for these tokens.
We should introduce source information for tokens and their origins, yes. But why/how does it limits us to keeping a combined or separate ASTs for preprocessor and C?
Macro invocation is very similar to function call.
This is not a good example, in my opinion, since IIRC there are a lot of special rules when macros are recognized as calls and when they aren't.
cpp
is a separate tool for a good reason.
I believe rely on cpp
as evidence how we should implement preprocessor is bad for developers.
Consider this experiment.c file
#define MAIN printf("HELLO")
int main(void) { MAIN }
I deliberately miss ;
after MAIN
.
CL output
Cesium.IntegrationTests/experiment.c(3): error C2143: syntax error: missing ';' before '}'
Cesium output
Unhandled exception. Cesium.Core.ParseException: Error during parsing Cesium.IntegrationTests/experiment.c. Error at position line 1, column 34. Got }.
As you see, but compilers provide obscure output. But CL is much better in the pointing at the actual line with error. Cesium not only unhelpful, he is actually lying about line number because that's preprocesor produce following output
int main(void) { printf("HELLO") }
We completely lost information about lines after preprocessing. I do not see a way how to solve that without merging parsers for C and C Preprocessor languages. Yoakke itself implement preprocessor based on shared token stream, even if pre-processor implementation itself slow moving. For different reasons obviosuly.
Your example demonstrates that we should have a way to preserve the source information after preprocessing, and I very much agree with that.
I still very strongly disagree with the idea of changing the grammar. We have a lot of hacks already, and it is very hard to reason about the parser: how does it correspond to the C standard? Does it have any parser conflicts? What we should change to migrate to C23 after it emerges?
And these questions will quickly become impossible to answer if we change the parser completely, inventing some unholy hybrid of C grammar and C preprocessor grammar. Moreover, I cannot see how it helps to achieve anything: this combined grammar won't have any source information embedded, either.
I believe that the preprocessor should generate some kind of annotated result (so we know from where each token comes). If this was (part of) your point, then I agree. Simple plain text output from the preprocessor won't work, and I agree on that, too.
Unfortunately, Yoakke isn't able to work with such token streams out-of-the-box, I believe. We may choose either to migrate to some other library (or a manual parser), or provide a bridge between the preprocessor-generated text and the source. One possible way to preserve the source information without changing Yoakke I imagine is the following:
(string, Dictionary<TextRange, SourceInformation>)
. Of course, there may be some peculiarities involved when we are trying to determine "the origin" of a C token created by the preprocessor (since the token may be glued together by several different macros), but that's in any case a question for our interpretation.Dictionary<TextRange, SourceInformation>
, because at that point, both text ranges (from the dictionary and from the C parser) operate on the same source: the text clob from the preprocessor output.If this was (part of) your point, then I agree.
That's most important part for me. Second part which I try to express is that we have IPeekableStream<CToken>
and CPreprocessor is just large IPeekableStream<CToken> Map(IPeekableStream<CToken> source)
which result get fed into CParser(IPeekableStream<CToken> source)
, but I will back this option only if that's very easy and do not put congnitive load.
Unfortunately, Yoakke isn't able to work with such token streams out-of-the-box, I believe.
SourceInformation
hopefully coming to Yoakke https://github.com/LanguageDev/Yoakke/pull/150#issuecomment-1236735148 My current understanding that this would be enough for us. and Yoakke help us a lot, so why not contribute back. If something is missing, it's good time to provide feedback if you think this is proper direction.
One possible way to preserve the source information without changing Yoakke I imagine is the following:
It seems to be we are on the same page, and only difference is that I think that we should fix Yoakke, but you consider workarounds to existing Yoakke.
To me, it's not a big deal whether we do it in this repo or contribute to Yoakke. The latter is a bit more complicated because we'll need to invent abstractions useful for other people as well for ourselves. But still, doable.
So far, I am thinking about separating the stuff into approx. three levels of abstraction.
For now, we have somewhat messed up pipeline: the IR nodes do all by themselves. Type checking, member resolve, lowering, codegen: we have everything everywhere. Right now, only parsing is properly isolated from everything else.
We should introduce more formal stages of compilation and maybe even create several different layers of IR? For example, we have certain constructs that have to be lowered (such as
+=
-styled operators). We may get rid of these constructs in the "lowered IR layer" and thus decrease the amount of type checks required in the codegen.I don't know how should it be organized, though. Just start from separating the IR layers, and everything else will click and fit in place?
Thoughts?
cc @kant2002, @impworks
When implementing, look for number
201
in the source and try to eliminate every instance of that number.