congo-cc / congo-parser-generator

The CongoCC Parser Generator, the Next Generation of JavaCC 21, which in turn was the next generation of JavaCC
https://discuss.congocc.org/
Other
36 stars 11 forks source link

The Transpilation Machinery really does need some reworking. #93

Open revusky opened 9 months ago

revusky commented 9 months ago

I'd like to bring your attention to this point in the current code. I rewrote the translateCodeBlock method there.

I mean, the way it was written (and there must be other cases of this) it really does give off a code smell, to put it mildly. The javaCodeBlock that is exposed to the template actually is an instance of org.congocc.parser.tree.CodeBlock. So, what you were doing is doing an implicit conversion of this to a string, and then instantiating a new parser instance to reparse it and...

Actually, I came across this because I inadvertently broke the code in my FreeMarker hacking, because the coercion of the code block variable into a string was broken for a while. (Basically the existing code worke because, if a Java method took a String parameter, it would coerce the parameter into a string, using the object's toString() method, and...) But anyway, then I looked at the code in question and was kind of shocked that it was getting flattened into a string and then reparsed into a tree, when it was a tree to start with! I mean, seriously, does that really make much sense?

Generally speaking, this code really does need to move more towards being a tree-to-tree transformation, rather than string-to-string. But, in partcular, the congocc.jar already contains all the classes for the Python and C# AST's. (This has been the case since early this year.) There are already all the nodes in org.congocc.parser.python.ast and org.congocc.parser.csharp.ast respectively. They are all empty classes, but you can inject functiionality into them by editing the files PythonInternal.ccc and CSharpInternal.ccc respectively.

Well, I mean to say, this proposed approach, reusing the generated AST for the Python and CSharp grammars is bound to result in a much more elegant and appealing solution. I'm quite sure of this. In particular, this humongous method can surely be refactored to use the built-in tree-walking Node.Visitor API that we have. And, in any case, this would result in more overall code reuse.

Well, if you want me to have a go at it to sort of show the way, I could do that, I guess. But I think you have the wherewithal to get going on it. And if you have any questions, by all means. I think the transpilation thing is fairly promising and I have speculated that we could even make a separate standalone tool out of it, but I think we need to get things on a somewhat more solid basis, if you see what I mean...

vsajip commented 9 months ago

I mean, the way it was written (and there must be other cases of this) it really does give off a code smell, to put it mildly.

Sure. I think the Joe Armstrong aphorism "Make it work, then make it beautiful" applies here - this is some of the earliest code I added to JavaCC21, before I really knew much about the existing code. I was working to prove the transpilation approach as quickly as possible.

I mean, seriously, does that really make much sense?

Not really ... mea culpa. Probably I was floundering around in the code-base in those early days, and I never revisited it.

But, in particular, the congocc.jar already contains all the classes for the Python and C# ASTs

Not sure that's relevant to transpilation - the job is to transform a Java AST in a grammar file (e.g. from an injection) into Python or C# code, which is optimally done using templates and doesn't involve parsing C# or Python code, so those ASTs aren't relevant as far as I can see.

In particular, this humongous method can surely be refactored to use the built-in tree-walking Node.Visitor API that we have

No doubt, and I'll take a look. It was certainly quicker for me to use the top-down approach that I did at the time, as I'm not sure there are/were any non-trivial examples of using the Node.Visitor API. I'm familiar with the general technique, of course - it's just that my experience with other node visitor APIs has left me lukewarm about their productivity when you're feeling your way around, as I was with transpilation when this framework was originally put in place. Also there have been numerous changes to the Java AST over time, and it's definitely easier to see what's happening as you step through in a top-down fashion after something breaks.

vsajip commented 9 months ago

And if you have any questions, by all means.

Given the existing signature of the visit method of Node.Visitor (having a void return type), how do you see tree-to-tree transformation working? Perhaps illustrate how you convert a MethodCall into an ASTInvocation as I do in Translator, but using the existing visitor API?

revusky commented 9 months ago

And if you have any questions, by all means.

Given the existing signature of the visit method of Node.Visitor (having a void return type), how do you see tree-to-tree transformation working? Perhaps illustrate how you convert a MethodCall into an ASTInvocation as I do in Translator, but using the existing visitor API?

Well, first of all (though it's maybe not the most important aspect of things) the individual visit(...) methods that you define actually can have a return value if you want. You see, the reflection machinery invokes the method via the class and the param types. See here and you see that the return type is not used when fishing out the handler method. So the return type is not actually used and your node handler can return a value. Well, it is also true that even if the method returns a value, that is only relevant in the cases where you explicitly call the method yourself. If you use the automatic recurse(node) sort of machinery, any return values are getting thrown away, since recurse(node) is effectively just syntactic sugar for:

   for (Node child : node) visit(child);

and, as you can see, any possible values returned by any visit(...) is just getting ignored. This actually got me wondering whether there was a possibility to make use of any return values, maybe with an internal stack where the return values are pushed onto the stack and then your own code can pop them off and do something with them. But that is just some wooly-minded thinking on my part. I certainly haven't thouht any of that through!

So that is a vague thing that I just started thinking about today as a result of your comment. The really important point is that it is not really necessary for the visit handlers to return values. After all, your visitor subclasses Node.Visitor and can store state, like build up a tree and/or lookup tables, as the thing traverses the tree. For example, if you look at the Reaper, this rough and ready class (it has some significant warts actually) to get rid of unused fields and methods, it builds up the data it needs in the tree traversal, and then in the final stage, it whacks the elements that it can prove to itself are not being used. The various visit methods basically build up the data, like usedVarNames and so on, and then in the final step, see here it whacks the various methods, fields, that are superfluous, using the data that was built up in the tree traversal.

Well, granted, the problem we're talking about is different, the tree to tree translation, but is maybe not fundamentally so different. You build up the data you need from traversing the tree. Well, probably you build a parallel tree of Python or CSharp nodes as you traverse the Java subtree. It occurs to me that there are various ways to skin a cat here. It just occurs to me that you could do a two-pass approach. In the first pass, just instantiate the parallel nodes for the corresponding Java nodes and just build a lookup map from the Java nodes to the Python nodes, say. So, you have a lookup table that maps all your Java MethodCall objects to Python MethodCall objects, and so on. And then in a second pass, you walk the Java subtree again and set all the parent-child relationships in the parallel node tree. It occurs to me that maybe that's simpler conceptually than doing it all in a single pass -- though that is probably quite feasible as well.

But anyway, the fact remains that we have (for free, really, by virtue of having the Python and CSharp grammars) an AST that represents quite precisely the parse tree for those languages. See build/generated-java/org/congocc/parser/python/ast and build/generated-java/org/congocc/parser/csharp/ast` respectively. All of that is generated from the grammars, right?

And, of course, generating the properly indented flat text representation of your generated Python subtree, for example, should be comparatively child's play, but probably just write another visitor that does that, pretty much analogous to this or this.

Well, I hope that answers your question. Of course, the proof of the pudding is in the implementation. But, offhand, I don't see any particular technical obstacle in the approach I just outlined, so...

vsajip commented 9 months ago

it's all just software, so "anything is possible". However, for me the biggest problem is development speed. A bottom-up approach like node visiting is just not as productive for me as the top-down approach I've used. (I know this from experience with writing a theme for the Sphinx documentation engine, which uses node-visit functionality to style and modify markup.

One has very limited context at the level of e.g. even a name. Is it a field, to be output in Python as self.foo? Or just a variable, to be output as foo?) While there is definitely some improvement that can be achieved in the existing code (e.g. break up the transformTree method), the basic transpilation approach is to take a Java AST and convert it to either Python or C# (or anything else). Python or C# ASTs don't figure into that (as I see it), as they would if you were parsing Python or C# code.