congo-cc / congo-parser-generator

The CongoCC Parser Generator, the Next Generation of JavaCC 21, which in turn was the next generation of JavaCC
https://discuss.congocc.org/
Other
39 stars 11 forks source link

I really wonder about the value of certain newer tree-building annotations #195

Open revusky opened 6 months ago

revusky commented 6 months ago

Well, specifically, as a case in point, consider this line. I mean, specifically, that one writes:

  ( @importDeclarations :+= ImportDeclaration )*!

instead of just:

 (ImportDeclaration)*!

causes the CompilationUnit node to have an extra List<Node> importDeclarations field and the various ImportDeclaration child nodes are added to that list. However, without that extra annotation, we already can write:

 List<ImportDeclaration> importDeclarations = compilationUnit.childrenOfType(ImportDeclaration.class);

I mean, we have the above possibility available for free, but aside from it just being automtically there, this gives us a List<ImportDeclaration> while the disposition with the extra annotation gives us a List<Node>, so that if we want to iterate over the nodes in the container, we would typically have to downcast. So, instead of writing simply:

     for (ImportDeclaration idecl : compilationUnit.childrenOfType(ImportDeclaration.class) {
               ...
     }

we would have to write:

    for (Node n : compilationUnit.getImportDeclarations()) {
              ImportDeclaration idecl = (ImportDeclaration) n;
              ...
    }

i.e. something that is more typical of pre-generics Java code. But I really just can't figure out what this is buying one, when you already have API that gives you all the ImportDeclaration child nodes and makes better use of the type system.

Well, maybe I'm just missing something about all this. I really honestly can't see where the extra feature is paying its weight. I recently patched this class so as not to use the getNamedChild (or the getNamedChildList) feature and, again, it's very hard to see what the real gain of this was.

Well, okay, there is a problem at times where a grammar production creates a new node or just passes up a single node that is on the stack. For example, if you wrote a production:

      AnnotatedIdentifier : (Annotation)* <IDENTIFIER> ;

This production, by default, will just put an IDENTIFIER token on the stack if there is no annotation, but otherwise rolls up a Annotated Identifier node. So it is true that if you had something like:

     AnnotatedIdentifier  /annotatedIdentifier/

then the getNamedChild("annotatedIdentifier") would potentially return just an IDENTIFIER token or an AnnotatedIdentifier node. And this means that it would have to return the base Node, since we don't know which node type we have.

But the other possibility would be just to make this an unconditionally built node, so we could have something like:

     PossiblyAnnotatedIdentifier# : (Annotation)* <IDENTIFIER>;

In that case, we would always build a node even if there are no annotations and the only thing the node contains is a single IDENTIFIER child. And actually, you can see that I addressed the issue with EnumConstant that way. I changed EnumConstant to an unconditional node.

But anyway, I would at least like to get rid of these tree-building annotations like the aforementioned @importDeclarations :+= ImportDeclarationfrom the Java grammar. I mean, aside from the fact that this is not really documented anywhere, I think that usually it is better to use the buit-in API that makes better use of the type system.

adMartem commented 5 months ago

Sorry to take so long to respond to this. I've been off in another world for awhile and it took me a couple of days to remember why I put this capability in.

I think it is mainly useful (i.e., used at least once by me for a reason other than testing) in the scenario of the X_JTB_PARSE-_TREE option. In that case, the tree generated is probably most accurately described as a parse tree, with lots of nodes generated that would not be there in a normal CongoCC-generated tree, especially if smart nodes are enabled. As a result, it is rather convoluted to assemble a single list of nonterminal nodes that have connectives (such as item, item, ... , item) even with the tools provided by the Node interface. Specifically, in JTB mode the expansion Foo ("," Foo)* cannot simply use childrenOfType(Foo.class) because only the first Foo is a direct child of the expansion Node. Even descendantsOfType(Foo.class) is problematic if there are other sub-components of the expansion that include Foo. Basically, with a JTB tree, every nonterminal always generates a node and every ExpansionUnit always generates a node, so it becomes potentially useful to be able to "collect" a list of nodes in this manner.

But probably the seminal reason I added the += annotation was because I was trying to harmonize the property assignment annotations with the namedChild/namedChildList feature Vinay added. The instances in the Java grammar I added were probably intended to be removed and I forgot they were there. I do remember that it was very difficult to get them working correctly with the Translator logic, which was a fountainhead of test cases for the implementation along the way. I even learned how to do some things in Java that I had never done and didn't think could be done, such as List<ImportDeclaration> importDecls = new ArrayList<>((List<ImportDeclaration>)(List)jcu.getImportDeclarations());.

Having said all that, I've only used the += once in the COBOL parser so far (the other new assignment annotations I've used all over the place). I certainly agree the cases they were used for in Java were contrived and didn't contribute to making the code better or clearer.

I really must find time to document this stuff. I had convinced myself that I needed to do so only after getting more certainty as to the value of some of them, and when the actual implementation fully stabilized. I'll try and revisit this sooner rather than later.

revusky commented 5 months ago

I certainly agree the cases they were used for in Java were contrived and didn't contribute to making the code better or clearer.

Well, the main problem I see with those things is that they give a mistaken impression to anybody coming across that stuff. I mean, if you have things like:

#CompilationUnit# :
  [ @packageDeclaration := PackageDeclaration! ]
  ( @importDeclarations :+= ImportDeclaration )*!
  ...

somebody who eyeballs that would easily think that these notations are necessary to be able to get at the child nodes like the package declaration or the import declarations. And that is a completely mistaken idea, of course. If you just write the same thing without those notation, i.e.

   #CompilationUnit
        [PackageDeclaration!]
        (ImportDeclaration)*!
        ...

there is already API automatically to get whatever nodes you want. You can write:

  CompilationUnit root = parser.CompilationUnit();
  List<ImportDeclaration> imports = root.childrenOfType(ImportDeclaration.class);

In fact, the code on the Java side that uses this is cleaner since the regular API returns a parametrized container. You iterate over it and you don't have to downcast as you would if this was a List<Node>. Well, you know all that I'm sure.... But, again, the problem I see is that it would be perfectly reasonable for anybody reading this to conclude (though mistakenly). that these tree-building annotations are necessary!

In general, I think we should really try to keep the grammars in examples/* as clean as possible and have them be pretty good usage examples. So, we really shouldn't be using the Java grammar (or the Python grammar, say) as a kind of sandbox to test experimental features... though I have to admit I've surely done that a bit myself!

I mean, actually, come to think of it, there is nothing preventing you from creating a sandbox/* or maybe experimental/* directory and putting in some experimental things there.

I even learned how to do some things in Java that I had never done and didn't think could be done, such as List importDecls = new ArrayList<>((List)(List)jcu.getImportDeclarations());.

Well, I guess you can do that, but the compiler doesn't like it. I mean, it surely issues some warnings about unsafe type conversions, though I guess you can also use the appropriate @SuppressWarnings annotation, which I guess exists for people who have a fetish about their code compiling with no warnings. (Though it seems to me that if you have a fetish about your code compiling with no warnings, you should also have a fetish about not using the @SuppressWarnings annotation!)

Well, anyway, I'm not really so against the X_JTB_PARSE option if you're doing the work to implement and maintain this and you find it useful in your own work. But I don't think we should be adding these things to the Java grammar (or any other examples/* grammar) when they are not even needed, because, like I said, it gives anybody who is trying to study these examples the wrong impression. It's quite reasonable for them to conclude that those various annotations are necessary when, in fact, they're mostly a result of playing around with experimental features that, frankly, most people don't really need. Or not not very often anyway.

Well, to be clear, I can see the use case for this sort of thing, like in some cases where you don't know the type of the node or nodes on the stack in a given spot. This tends to happen as a result of the SMART_NODE_CREATION feature which is on by default. If a production only puts one node on the stack, by default, it doesn't roll up a new node. But without that, you get this kind of Matryoshka doll situation where you're drilling down so many levels deep to get the information you want because, well, you generate an AST full of Matryoshka dolls. I guess that's what JTB does typically, though, to be honest, I don't really know anything about JTB. I guess some people like JTB though, so...

revusky commented 5 months ago

As a result, it is rather convoluted to assemble a single list of nonterminal nodes that have connectives (such as item, item, ... , item) even with the tools provided by the Node interface. Specifically, in JTB mode the expansion Foo ("," Foo)* cannot simply use childrenOfType(Foo.class) because only the first Foo is a direct child of the expansion Node.

Well, okay, but another option is just to write getters by hand that return the list of nodes you're interested in, i.e.:

  INJECT SomeParentNode :
  {
        public List<SomeNodeType> getNodesOfInterest() {
              .... whatever logic to fish out the nodes and return them as a list...
           }
  }

I understand that this is more verbose than the += notation and such, but probably the injected getter is rarely more than a few lines. Oh, and also, I should point out that there is the form of the childrenOfType or descendantsOfType that takes a predicate as an argument, so you can use a lambda, say.

    public List<SomeNodeType getWhateverNodes() {
        return descendantsOfType(SomeNodeType.class, n->{blah blah.... return true or false;});
    }

I mean, if you can express the condition for inclusion in the list tersely in a lambda, you can have fairly terse code. Or if the lambda is unwieldy, you can break out the logic into a method and then use a method reference.

But, I mean, JTB is prehistoric, there were no lambdas or functional interfaces or any of that back then. Actually, I think when they wrote JTB there was not even the collections classes, like java.util.List and such! Actually, that is all true of JavaCC itself!

adMartem commented 3 months ago

As you can see I'm having trouble defending the += annotation. I really think I did it just to replicate Vinay's named-child notation functionality consistent with the LHS variants, and then extended it to also have a property-generating form. In other words, first came /myChildName/ += MyProduction, then came @myChildNameList += MyProduction, and finally, the most elaborate form @myChildNameList :+= MyProduction. Perhaps all three are a mistake for the reasons you cite. Almost certainly they do not exemplify best practice for CongoCC for most use cases. I plan to revisit this area when I can get to documenting it, so I will scrutinize the utility of that (+=) notation at that time. In the meantime, it should not be used in any of the examples, for sure.