congo-cc / congo-parser-generator

The CongoCC Parser Generator, the Next Generation of JavaCC 21, which in turn was the next generation of JavaCC
https://discuss.congocc.org/
Other
33 stars 9 forks source link

C# grammar allows random # characters in source code without flagging errors #158

Closed vsajip closed 3 months ago

vsajip commented 4 months ago

For example,

#pragma warn disable 999
#
#
#

####

#
namespace foo.bar### {
#if true
    /* This is a comment. */
#else
    // This is another.
#endif
}

is parsed without errors.

revusky commented 4 months ago

Well, I guess that needs to be caught in the relevant TOKEN_HOOK routine. I'm surprised it isn't.

vsajip commented 4 months ago

I think it's related to setting the token.type to INVALID vs. returning a new InvalidToken.

revusky commented 4 months ago

I think it's related to setting the token.type to INVALID vs. returning a new InvalidToken.

Yeah, well, this is all kind of coming back to the fact that all the fault-tolerant stuff is kind of green, and really has been for several years. It's time to really get all that stuff properly nailed down. I'm thinking that something to do over the coming while is for us to have all of the parsers that come with Congo buildable as fault-tolerant and be sure that this is actually working. Well, FreeMarker as well. I think that the FreeMarker parser being able to parse in fault-tolterant mode would be a significant improvement because we could get a list of all the errors, as opposed to just the stack trace on the first error that is hit in parsing.

vsajip commented 4 months ago

this is all kind of coming back to the fact that all the fault-tolerant stuff

I thought the fault-tolerant stuff was parser-level, but this stuff is all lexer-level, isn't it?

revusky commented 4 months ago

Well, repeat-resync is parser-level, but there is also lexer level logic. I guess the most detailed explanation of how the overall thing works (or is supposed to work) is here: https://parsers.org/javacc21/the-promised-land-fault-tolerant-parsing/

But I am really puzzled that there is a difference between returning InvalidToken and tok.setType(INVALID). I really have to get my head back into all that stuff.

vsajip commented 4 months ago

I'll leave it with you, then. Without a proper exposition of how the skipped/ignore/caching stuff works, I am likely to be just floundering around in the dark - I've tried a few things and none of them seem to work, because I'm effectively just tinkering!

revusky commented 4 months ago

I'll leave it with you, then.

Well, I had assumed that it was up to me to fix this stuff. But I was also delighted to see you having a go at it. Well, in any case, you're pretty obviously advancing in your overall understanding of how everything works. (Or is supposed to work!)

vsajip commented 4 months ago

It also seems a bit odd to have isInvalid(), isUndefined(), isEOF() on the token type - ISTM one generally only cares about this at the Token level.

vsajip commented 4 months ago

I've added a test for unparsed content. This is the current expected output for parsing a simple C# file: https://github.com/congo-cc/congo-parser-generator/blob/c99ca73a2746be3a8c5536238c9f785002b8ba07/misc_tests.py#L380-L394 Once the bugs around unparsed contents are fixed, the expected output will need to be changed to remove or otherwise alter the # characters (which correspond with #if/#else directives in the source) that appear around the comment in the output, in order for the test to continue to pass.

It would also be really helpful if all the template changes appear, once merged, as a single squashed commit - that would make syncing Java templates with the other languages much easier.