Open codinuum opened 6 months ago
Hi! well spotted! Can you attach the test files? Thanks!
Here they are. a0.java.txt a1.java.txt
Great!
Ok with a quick gumtree parse a0.java
I get
CompilationUnit [0,114]
TypeDeclaration [0,113]
TYPE_DECLARATION_KIND: class [0,5]
SimpleName: Foo [6,9]
MethodDeclaration [14,111]
Modifier: public [14,20]
Modifier: static [21,27]
Modifier: final [28,33]
Modifier: public [49,55]
Modifier: static [56,62]
PrimitiveType: void [63,67]
SimpleName: main [68,72]
Block [75,111]
ExpressionStatement [81,107]
MethodInvocation [81,106]
METHOD_INVOCATION_RECEIVER [81,91]
QualifiedName: System.out [81,91]
SimpleName: println [92,99]
METHOD_INVOCATION_ARGUMENTS [100,105]
StringLiteral: "bar" [100,105]
The field declaration is strangely completely ignored... This is strange. I have to look in the jdt generator code.
OK by opening the file I realize that there is no return type for the attribute which is a syntax error. If you add the type String to both files it works. I guess the big question is why there is no SyntaxError raised by JDT in this case ;-)
Working on my side, no?
module [0,113]
expression_statement [0,9]
assignment [0,9]
identifier: s [0,1]
=: = [2,3]
string: 'bar' [4,9]
function_definition [12,112]
identifier: match [16,21]
parameters [21,30]
identifier: command [22,29]
block [36,112]
return_statement [36,112]
parenthesized_expression [43,112]
boolean_operator [44,111]
comparison_operator [44,74]
string: 'push' [44,50]
attribute [54,74]
identifier: command [54,61]
identifier: script_parts [62,74]
and: and [75,78]
comparison_operator [79,111]
string: 'set-upstream' [79,93]
attribute [97,111]
identifier: command [97,104]
identifier: output [105,111]⏎
Parsing results in the following in my environment...
module [0,113]
expression_statement [0,9]
assignment [0,9]
identifier: s [0,1]
=: = [2,3]
string [4,9]
": ' [4,5]
": ' [8,9]
function_definition [12,112]
def: def [12,15]
identifier: match [16,21]
parameters [21,30]
(: ( [21,22]
identifier: command [22,29]
): ) [29,30]
:: : [30,31]
block [36,112]
return_statement [36,112]
return: return [36,42]
parenthesized_expression [43,112]
(: ( [43,44]
boolean_operator [44,111]
comparison_operator [44,74]
string [44,50]
": ' [44,45]
": ' [49,50]
in: in [51,53]
attribute [54,74]
identifier: command [54,61]
.: . [61,62]
identifier: script_parts [62,74]
and: and [75,78]
comparison_operator [79,111]
string [79,93]
": " [79,80]
": " [92,93]
in: in [94,96]
attribute [97,111]
identifier: command [97,104]
.: . [104,105]
identifier: output [105,111]
): ) [111,112]
@jrfaller I can confirm that JDT Parser skips parts of the AST that are syntactically invalid. Missing type in a field declaration makes it syntactically invalid. JDT Parser will continue parsing the rest of the file.
@tsantalis yes just what I realized. On top of that, neither the flag MALFORMED nor the flag RECOVERED seems to work in these cases. The only thing I get is that I have one "problem" in the problem list when casting the node to a CompilationUnit. I guess it could work better to detect syntax errors but I fear that it will report more than just syntax errors 😢 WDYT?
@jrfaller Updating tree-sitter-parser resolved the problem. Thanks a lot!
@tsantalis yes just what I realized. On top of that, neither the flag MALFORMED nor the flag RECOVERED seems to work in these cases. The only thing I get is that I have one "problem" in the problem list when casting the node to a CompilationUnit. I guess it could work better to detect syntax errors but I fear that it will report more than just syntax errors 😢 WDYT?
For commit analysis, which cannot be guaranteed to be syntactically valid and compilable, I think what JDT Parser is doing is fine. Instead of failing the whole file, it just skips the syntactically invalid sub-trees. There should be a way to isolate syntax errors from the problem list. I will look into it.
That reminds me of something.
I have seen conflict markers >>>>>>>
, =======
, and <<<<<<<
left even in release versions of some projects.
@jrfaller
We discovered that JDT ASTParser recommends a version to setup the parser, when it finds a parsing problem. For example, when parsing a record declaration with 1.8 compatibility, it generates a problem suggesting the use of Java 16 compatibility to parse record declarations.
We basically process the reported problems and re-parse the file with the recommended compatibility version. Here is our implementation in RefactoringMiner.
Hi @tsantalis and thanks a lot for the tips. In the last commit I switched to private final static String JAVA_VERSION = JavaCore.latestSupportedJavaVersion();
for compiler settings and ASTParser parser = ASTParser.newParser(AST.getJLSLatest());
for the parser initialization. Do you think anything can go wrong like this or parsers are backward compatible? Cheers!
@jrfaller
The parser initialization ASTParser parser = ASTParser.newParser(AST.getJLSLatest());
is totally fine and creates no problems.
The problem is with the compiler settings.
For example, since Java 16 record
is a keyword reserved for record type declarations.
But before Java 16, you could use record
as a variable name in your code.
(The same thing happened many years ago when enum
was introduced as a keyword in Java 1.5,
but now it is very rare to find a commit using enum
as a variable. The code should be older than 2004.)
If you put the latest Java version in the compiler settings, and parse a code using record
as a variable
the parser will not generate a complete AST and will skip the parts using record as a variable name.
The best strategy is to start with a lower version compiler setting.
We start from JavaCore.VERSION_14
because we found that since this version the parser generates recommendations about what version to use to parse some code, which was not successfully parsed.
If we get such a recommendation in the problems, we re-parse the file using the recommended version in the compiler settings.
@tsantalis thanks a lot for the detailed explanation (and for pointing me the fix!)
String literals seem to be ignored in some cases. (tested with build 6584227)