GumTreeDiff / gumtree

An awesome code differencing tool
https://github.com/GumTreeDiff/gumtree/wiki
GNU Lesser General Public License v3.0
931 stars 173 forks source link

Incorrect Line Numbers When Parsing C# #200

Closed glGarg closed 4 years ago

glGarg commented 4 years ago

I am getting incorrect line numbers for nodes in the parse when parsing C# code. Here is a simple test case that fails:

class A
{
    private int i;
}

Output:

"root": {
    "type": "unit",
    "pos": "0",
    "length": "0",
    "children": [
        {
            "type": "class",
            "pos": "0",
            "length": "29",
            "children": [
                {
                    "type": "name",
                    "label": "A",
                    "pos": "6",
                    "length": "1",
                    "children": []
                },
                {
                    "type": "block",
                    "pos": "7",
                    "length": "22",
                    "children": [
                        {
                            "type": "decl_stmt",
                            "pos": "13",
                            "length": "14",
                            "children": [
                                {
                                    "type": "decl",
                                    "pos": "13",
                                    "length": "13",
                                    "children": [
                                        {
                                            "type": "type",
                                            "pos": "13",
                                            "length": "11",
                                            "children": [
                                                {
                                                    "type": "specifier",
                                                    "label": "private",
                                                    "pos": "13",
                                                    "length": "7",
                                                    "children": []
                                                },
                                                {
                                                    "type": "name",
                                                    "label": "int",
                                                    "pos": "21",
                                                    "length": "3",
                                                    "children": []
                                                }
                                            ]
                                        },
                                        {
                                            "type": "name",
                                            "label": "i",
                                            "pos": "25",
                                            "length": "1",
                                            "children": []
                                        }
                                    ]
                                }
                            ]
                        }
                    ]
                }
            ]
        }
    ]
}

Declaration statement private int i; starts at 14 but the parse outputs 13. Please let me know if I'm doing something wrong.

glGarg commented 4 years ago

Here is another test case:

class A
{
    private int a;
    private int b;
}

Output:

unit [0,0]
    class [0,48]
        name: A [6,7]
        block [7,48]
            decl_stmt [13,27]
                decl [13,26]
                    type [13,24]
                        specifier: private [13,20]
                        name: int [21,24]
                    name: a [25,26]
            decl_stmt [32,46]
                decl [32,45]
                    type [32,43]
                        specifier: private [32,39]
                        name: int [40,43]
                    name: b [44,45]

The above test case is exactly 50 characters in length, so class node should end at index 49 i.e. class [0,49]. The correct output should therefore be:

unit [0,0]
    class [0,49]
        name: A [6,7]
        block [8,49]
            decl_stmt [14,28]
                decl [14,27]
                    type [14,25]
                        specifier: private [14,21]
                        name: int [22,25]
                    name: a [26,27]
            decl_stmt [33,47]
                decl [33,46]
                    type [33,44]
                        specifier: private [33,40]
                        name: int [41,44]
                    name: b [45,46]

If we compare the outputs, we see that the bug only affects nodes that come after the first line (block, decl_stmt, etc.). Having looked at the code, I think I know where this is coming from. This may be a off-by-one bug coming from the positionFor() function in LineReader.java. https://github.com/GumTreeDiff/gumtree/blob/8ad77b5f882aa9e102401f71ac5a42133a29867d/core/src/main/java/com/github/gumtreediff/io/LineReader.java#L50-L56

I understand that the column - 1 in the return statement is intended to shift indices coming from srcML as they start counting at 1. But, this -1 is actually not needed since the array lines already stores the correct position of newlines and column acts as a distance from the newline to where the node begins. All we also need is to fix the initialization of lines itself to start off at [-1] instead of [0] because the first newline doesn't occur at index 0. The 0 initialization is causing the need for -1 in the first place.

I'll submit a PR for this fix.

jrfaller commented 4 years ago

I believe you're right :D Thanks a lot!

glGarg commented 4 years ago

No worries, thanks!