GumTreeDiff / gumtree

An awesome code differencing tool
https://github.com/GumTreeDiff/gumtree/wiki
GNU Lesser General Public License v3.0
893 stars 170 forks source link

No syntax exception raised in some cases for JDT #325

Open codinuum opened 6 months ago

codinuum commented 6 months ago

String literals seem to be ignored in some cases. (tested with build 6584227)

01-17-2024 22 45 17
jrfaller commented 6 months ago

Hi! well spotted! Can you attach the test files? Thanks!

codinuum commented 6 months ago

Here they are. a0.java.txt a1.java.txt

jrfaller commented 6 months ago

Great!

Ok with a quick gumtree parse a0.java

I get

CompilationUnit [0,114]
    TypeDeclaration [0,113]
        TYPE_DECLARATION_KIND: class [0,5]
        SimpleName: Foo [6,9]
        MethodDeclaration [14,111]
            Modifier: public [14,20]
            Modifier: static [21,27]
            Modifier: final [28,33]
            Modifier: public [49,55]
            Modifier: static [56,62]
            PrimitiveType: void [63,67]
            SimpleName: main [68,72]
            Block [75,111]
                ExpressionStatement [81,107]
                    MethodInvocation [81,106]
                        METHOD_INVOCATION_RECEIVER [81,91]
                            QualifiedName: System.out [81,91]
                        SimpleName: println [92,99]
                        METHOD_INVOCATION_ARGUMENTS [100,105]
                            StringLiteral: "bar" [100,105]

The field declaration is strangely completely ignored... This is strange. I have to look in the jdt generator code.

jrfaller commented 6 months ago

OK by opening the file I realize that there is no return type for the attribute which is a syntax error. If you add the type String to both files it works. I guess the big question is why there is no SyntaxError raised by JDT in this case ;-)

codinuum commented 6 months ago

Confirmed. Thanks! How about comparing another pair of files with python-treesitter? a0.py.txt a1.py.txt String literals are missing in the parse trees...

jrfaller commented 6 months ago

Working on my side, no?

module [0,113]
    expression_statement [0,9]
        assignment [0,9]
            identifier: s [0,1]
            =: = [2,3]
            string: 'bar' [4,9]
    function_definition [12,112]
        identifier: match [16,21]
        parameters [21,30]
            identifier: command [22,29]
        block [36,112]
            return_statement [36,112]
                parenthesized_expression [43,112]
                    boolean_operator [44,111]
                        comparison_operator [44,74]
                            string: 'push' [44,50]
                            attribute [54,74]
                                identifier: command [54,61]
                                identifier: script_parts [62,74]
                        and: and [75,78]
                        comparison_operator [79,111]
                            string: 'set-upstream' [79,93]
                            attribute [97,111]
                                identifier: command [97,104]
                                identifier: output [105,111]⏎
codinuum commented 6 months ago

Parsing results in the following in my environment...

module [0,113]
    expression_statement [0,9]
        assignment [0,9]
            identifier: s [0,1]
            =: = [2,3]
            string [4,9]
                ": ' [4,5]
                ": ' [8,9]
    function_definition [12,112]
        def: def [12,15]
        identifier: match [16,21]
        parameters [21,30]
            (: ( [21,22]
            identifier: command [22,29]
            ): ) [29,30]
        :: : [30,31]
        block [36,112]
            return_statement [36,112]
                return: return [36,42]
                parenthesized_expression [43,112]
                    (: ( [43,44]
                    boolean_operator [44,111]
                        comparison_operator [44,74]
                            string [44,50]
                                ": ' [44,45]
                                ": ' [49,50]
                            in: in [51,53]
                            attribute [54,74]
                                identifier: command [54,61]
                                .: . [61,62]
                                identifier: script_parts [62,74]
                        and: and [75,78]
                        comparison_operator [79,111]
                            string [79,93]
                                ": " [79,80]
                                ": " [92,93]
                            in: in [94,96]
                            attribute [97,111]
                                identifier: command [97,104]
                                .: . [104,105]
                                identifier: output [105,111]
                    ): ) [111,112]
tsantalis commented 6 months ago

@jrfaller I can confirm that JDT Parser skips parts of the AST that are syntactically invalid. Missing type in a field declaration makes it syntactically invalid. JDT Parser will continue parsing the rest of the file.

jrfaller commented 6 months ago

@tsantalis yes just what I realized. On top of that, neither the flag MALFORMED nor the flag RECOVERED seems to work in these cases. The only thing I get is that I have one "problem" in the problem list when casting the node to a CompilationUnit. I guess it could work better to detect syntax errors but I fear that it will report more than just syntax errors 😢 WDYT?

codinuum commented 6 months ago

@jrfaller Updating tree-sitter-parser resolved the problem. Thanks a lot!

tsantalis commented 6 months ago

@tsantalis yes just what I realized. On top of that, neither the flag MALFORMED nor the flag RECOVERED seems to work in these cases. The only thing I get is that I have one "problem" in the problem list when casting the node to a CompilationUnit. I guess it could work better to detect syntax errors but I fear that it will report more than just syntax errors 😢 WDYT?

For commit analysis, which cannot be guaranteed to be syntactically valid and compilable, I think what JDT Parser is doing is fine. Instead of failing the whole file, it just skips the syntactically invalid sub-trees. There should be a way to isolate syntax errors from the problem list. I will look into it.

codinuum commented 6 months ago

That reminds me of something. I have seen conflict markers >>>>>>>, =======, and <<<<<<< left even in release versions of some projects.

tsantalis commented 1 month ago

@jrfaller

We discovered that JDT ASTParser recommends a version to setup the parser, when it finds a parsing problem. For example, when parsing a record declaration with 1.8 compatibility, it generates a problem suggesting the use of Java 16 compatibility to parse record declarations.

We basically process the reported problems and re-parse the file with the recommended compatibility version. Here is our implementation in RefactoringMiner.

https://github.com/tsantalis/RefactoringMiner/blob/master/src/main/java/gr/uom/java/xmi/UMLModelASTReader.java#L160-L177

jrfaller commented 1 month ago

Hi @tsantalis and thanks a lot for the tips. In the last commit I switched to private final static String JAVA_VERSION = JavaCore.latestSupportedJavaVersion(); for compiler settings and ASTParser parser = ASTParser.newParser(AST.getJLSLatest()); for the parser initialization. Do you think anything can go wrong like this or parsers are backward compatible? Cheers!

tsantalis commented 1 month ago

@jrfaller The parser initialization ASTParser parser = ASTParser.newParser(AST.getJLSLatest()); is totally fine and creates no problems.

The problem is with the compiler settings. For example, since Java 16 record is a keyword reserved for record type declarations. But before Java 16, you could use record as a variable name in your code.

(The same thing happened many years ago when enum was introduced as a keyword in Java 1.5, but now it is very rare to find a commit using enum as a variable. The code should be older than 2004.)

If you put the latest Java version in the compiler settings, and parse a code using record as a variable the parser will not generate a complete AST and will skip the parts using record as a variable name.

The best strategy is to start with a lower version compiler setting. We start from JavaCore.VERSION_14 because we found that since this version the parser generates recommendations about what version to use to parse some code, which was not successfully parsed.

If we get such a recommendation in the problems, we re-parse the file using the recommended version in the compiler settings.

jrfaller commented 1 month ago

@tsantalis thanks a lot for the detailed explanation (and for pointing me the fix!)