beanshell / beanshell

Beanshell scripting language
Apache License 2.0
815 stars 183 forks source link

Problem handling multi-line comments #753

Open opeongo opened 4 months ago

opeongo commented 4 months ago

There seems to be a problem in the way that multi-line comments are handled in the grammar. The regular expression token for the multi-line comment also catches formal comments, so in fact formal comments are never produced: they are always all matched and skipped by the MULTI_LINE_COMMENT token.

Here is the line in question from the .jjt file: https://github.com/beanshell/beanshell/blame/b35aac408f9c86ec20f1201a5ae9582ed99e2ae2/src/main/jjtree/bsh.jjt#L312

This problem could be fixed with the following regular expression that would not gobble formal comments:

    ("/***" (["*"])* | "/*" ~["*"]) (~["*"])* "*" ("*" | (~["*","/"] (~["*"])* "*"))* "/">

However, I don't think that this is the right fix. Javadoc accepts multiline comments that start with two or more asterisks. The existing regular expression will skip comments that start with three or more asterisks, which is different behaviour than Java. So parsing Java and looking for formal comments is broken w.r.t. Java for syntactically correct Java code. BeanShell claims that it can parse Java files, so this is a bit of a misconfiguration.

I think that the correct fix is to change the MULTI_LINE_COMMENT regular expression to only take /* comments as multi-line comments. Any comment that begins with two or more asterisks should be passed on so that it can be parsed as a FORMAL_COMMENT. In that case the FORMAL_COMMENT regular expression should become:

 "/**" "*"* (~["*"])* "*" ("*" | (~["*","/"] (~["*"])* "*"))* "/"
opeongo commented 4 months ago

When you pull on the string more issues crop up.

Since FORMAL_COMMENT is defined as a regular token then it has to be included in to the grammar. Currently a FORMAL_COMMENT may only appear where a block statement is expected. This means that in BeanShell formal comments may not appear within statements. So the following which is legal in Java is not allowed in BeanShell:

if (1 == 2)
    assert(false);
/**
This comment is OK in java, not allowed in BeanShell
*/
else if (2 == 2)
    assert(true);
/**
Same with this one.
*/
else
    assert(false);

Maybe this seems like a corner case, but it will throw a parsing error.

Perhaps there needs to be some trick with lexical states to ignore formal comments within statements?

opeongo commented 4 months ago

Maybe the best and easiest solution to this problem is to not have formal comments be a proper token, and instead they should just be a SPECIAL_TOKEN like the other comments. That would completely solve the issue about the types and locations of comments and compatibility with Java.

The drawback is that there would no longer be a BSHFormalComment node. The only use that I see for formal comments is the scripts/bshdoc.bsh script which automatically generates documentation from beanshell scripts (kind of like how javadoc does it). Does anything else use BSHFormalComment?

The bshdoc.bsh script could be changed to extract the formal comments from the special tokens, not too difficult (see the JavaCC docs). That would require that both the SimpleNode class and the SimpleNode.firstToken field had public access (see #730), or providing accessor functions.

Any thoughts on this?

nickl- commented 4 months ago

Maybe the best and easiest solution to this problem is to not have formal comments be a proper token, and instead they should just be a SPECIAL_TOKEN like the other comments. That would completely solve the issue about the types and locations of comments and compatibility with Java.

The drawback is that there would no longer be a BSHFormalComment node. The only use that I see for formal comments is the scripts/bshdoc.bsh script which automatically generates documentation from beanshell scripts (kind of like how javadoc does it). Does anything else use BSHFormalComment?

Any thoughts on this?

Yes it is used for documentation.

The bshdoc.bsh script could be changed to extract the formal comments from the special tokens, not too difficult (see the JavaCC docs). That would require that both the SimpleNode class and the SimpleNode.firstToken field had public access (see #730), or providing accessor functions.

Will have to look into this, not too keen on public exposure which has other drawbacks.

nickl- commented 4 months ago

754 merged into master

opeongo commented 4 months ago

Yes it is used for documentation.

With this type of change formal comments can still be used by tools that want to process documentation, but they will have to use a different accessor (call a method instead of matching on a node). The trade-off is between keeping a troublesome parsing bug or changing an obscure internal part of the API.

Will have to look into this, not too keen on public exposure which has other drawbacks.

From within the scripting environment there is no exposure of the current parser, that is all 'back stage'. The exposure would be to the creator of the parser, who has full control anyways.

revusky commented 4 months ago

You know, you guys would really be doing yourselves a huge favour if you just made a big push to get the CongoCC based grammar integrated. It's just been sitting there for 9 months at this point.

Tom, did you ever take a look at it? https://github.com/beanshell/beanshell/blob/congocc/src/main/congo/Beanshell.ccc In particular, CongoCC has some significant improvements to the tokenization side. In particular, the new lazy tokenization feature (that was not even present last April) allows you to specify these sorts of comment tokens much more cleanly. Also, the ability to turn on/off token types (a not so new feature) also allows one to deal with these things more cleanly without defining extra lexical states. See: https://parsers.org/javacc21/activating-de-activating-tokens/ So you could, for example, activate/deactivate the FORMAL_COMMENT token type on an as-needed basis.

nickl- commented 1 month ago

With this type of change formal comments can still be used by tools that want to process documentation, but they will have to use a different accessor (call a method instead of matching on a node). The trade-off is between keeping a troublesome parsing bug or changing an obscure internal part of the API.

Will have to look into this, not too keen on public exposure which has other drawbacks.

From within the scripting environment there is no exposure of the current parser, that is all 'back stage'. The exposure would be to the creator of the parser, who has full control anyways.

By increasing the visibility we increase the exposure, meaning it is not just for internal use but users have access too. This brings in itself other drawbacks like having to maintain depredations etc.