Scala.g4 fails to parse public spark code

talwgx commented 1 year ago

Attempting to parse AST of Spark code: https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala using Scala.g4 (latest ver on master) results in errors below (multi-line comments were removed during pre-processing, also double checkd with: https://tools.knowledgewalls.com/remove-comments-from-any-program-online). An older version of this file here: https://gist.github.com/talwgx/c53e57377a92558168819e3a8f23d93b also failed in a diff location

742:10 mismatched input 'def' expecting {'implicit', 'lazy', 'case', 'override', 'abstract', 'final', 'sealed', 'private', 'protected', 'class', 'object', 'trait'} 757:11 mismatched input 'def' expecting {'implicit', 'lazy', 'case', 'override', 'abstract', 'final', 'sealed', 'private', 'protected', 'class', 'object', 'trait'} 761:11 mismatched input 'def' expecting {'implicit', 'lazy', 'case', 'override', 'abstract', 'final', 'sealed', 'private', 'protected', 'class', 'object', 'trait'} 766:11 mismatched input 'def' expecting {'implicit', 'lazy', 'case', 'override', 'abstract', 'final', 'sealed', 'private', 'protected', 'class', 'object', 'trait'} 772:11 mismatched input 'def' expecting {'implicit', 'lazy', 'case', 'override', 'abstract', 'final', 'sealed', 'private', 'protected', 'class', 'object', 'trait'} 785:11 mismatched input 'def' expecting {'implicit', 'lazy', 'case', 'override', 'abstract', 'final', 'sealed', 'private', 'protected', 'class', 'object', 'trait'} 801:21 mismatched input 'def' expecting {'implicit', 'lazy', 'case', 'override', 'abstract', 'final', 'sealed', 'private', 'protected', 'class', 'object', 'trait'} 805:10 mismatched input 'def' expecting {'implicit', 'lazy', 'case', 'override', 'abstract', 'final', 'sealed', 'private', 'protected', 'class', 'object', 'trait'}

pic is from a locally run antlr-lab (the online ver aborts bc of the grammar warn)

antlr

Any help would be much appreciated!

@Marti2203

Marti2203 commented 1 year ago

Hi, Can you show me how you parse the file? Do you use the special token stream?

kaby76 commented 1 year ago

Note, due to the new "discovery" mode of testing (trgen parses all grammars and then extracts out the start rule and grammar name so you don't need to ever declare this), the testing of this will fail. compilationUnit does not have an EOF. Consequently, if one makes a change for this, compilationUnit should be fixed with an EOF at the end of the rule. All targets "work", but PHP is not tested because it's too damn slow. Note, I'm doing a rewrite of significant parts of the Antlr PHP runtime. There's a warning from the Antlr tool warning(131): Scala.g4:1349:25: greedy block ()* contains wildcard; the non-greedy syntax ()*? may be preferred. This should be fixed because at some point warnings will be flagged as errors on the build.

kaby76 commented 1 year ago

Look at tokens for input https://github.com/apache/spark/blob/3dd629629ab151688b82a3aa66e1b5fa568afbfa/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala. Multiline comments don't work. Preferrably, all those "-> skip" should be "-> channel(HIDDEN)" so one can see the tokens when using lab.antlr.org. MesosCoarseGrainedSchedulerBackend.scala.txt

kaby76 commented 1 year ago

A possible fix for multi-line comments.

COMMENT : MLC -> channel(HIDDEN) ;
fragment MLC : '/*' (MLC | .)*? '*/' ;

It would need to be tested with nested /* ... */, which I gather Scala supports. (Yuck.)

Fails now in other places, so there are other problems with the grammar.

kaby76 commented 1 year ago

The doc/grammar for string literals (here) hasn't kept up with the features, interpolation. Why does that not surprise me.

csharp grammar way of modelling string interpolation here might be a way to proceed. If done that way, lab.antlr.org cannot be used because there would be action code.

Marti2203 commented 1 year ago

@kaby76 I am pretty sure I took inspiration from the String interpolation in C# when I made this 2 years ago. The Scala grammar is quite complicated and at the time I made it parse spark and other projects, I do not know if things have changed drastically. Are you using the custom Token stream I have made as without it Newlines are pretty hard to track?and things will break ( the grammar also does not support scala 3)? I wanted to get more work done but uni took up a lot of time.

talwgx commented 1 year ago

@Marti2203 I'm using https://github.com/antlr/antlr4-lab/blob/master/src/org/antlr/v4/server/ANTLRHttpServer.java to parse scala.g4 from master. There used to be a special token stream but i dont see it on master anymore, so lmk if you want me to check with it. The antlr-lab code aborts if there's a warning in the grammar (which there is one) so I let it run even with the warning, which is the screenshot you see at the top of the thread. Thanks!

Marti2203 commented 1 year ago

Nevermind, I forgot that it never got merged in....

kaby76 commented 1 year ago

@Marti2203 OK, got it. https://github.com/Marti2203/grammars-v4/tree/master/scala. https://github.com/antlr/grammars-v4/pull/2228

Marti2203 commented 1 year ago

Yeah, sorry for the stupid situation, it has been too long... the things I have work but were never pushed to this repo :disappointed:

kaby76 commented 1 year ago

@Marti2203 Not a problem. I'm going to refactor your code so that it fits into "target agnostic format", i.e., the code for custom token stream placed in the lexer base class. We could use the custom token stream as is, but then the driver would have to be custom code. trgen could handle that (place a custom driver to create the special token stream in scala/Java/Test.java), but I prefer not to do that as I update the driver code occasionally. (Or, I could have "transformGrammar.py", which is called to mutate the .g4's to target-specific format, to also modify the driver, but also prefer not to do that.)

Marti2203 commented 1 year ago

Whatever is the best way to make it fit into the framework for the repo and thanks so much for taking it over :pray:

kaby76 commented 1 year ago

I rolled the Java ScalaTokenStream code into a C# version of the ScalaLexerBase.NextToken() method. To support LT() and LB() in a token stream, I wrote a simple token buffer class with those methods. In NextToken(), the entire input is read, tokenized, and placed in a list of tokens. Then, it goes through the token list to mutate the token channel to HIDDEN, depending on what canEmitNLToken() says. Finally, it resets to the beginning of the buffered token list. and lets NextToken() return the tokens in order. StepBack() is not needed because the list of tokens is entirely buffered. The parse handles the example/ test cases, but does not work yet for many .scala files in the Spark code.

antlr / grammars-v4

Scala.g4 fails to parse public spark code #3260