danipen / TextMateSharp

A port of tm4e to bring TextMate grammars to dotnet ecosystem
MIT License
101 stars 19 forks source link

Unexpected behavior in indexes and length. #63

Closed thempen closed 3 months ago

thempen commented 3 months ago

Hello,

I am confused about the StartIndex, EndIndex and Length of IToken. I was programming a demo file to parse proto (ProtoBuf) messages. While parsing a comment line, the string had 26 chars. The EndIndex and Length value were both resulting in 27 chars.

This is the snippet of the textmate json:

    "comments": {
      "patterns": [
        {
          "name": "comment.block.proto",
          "begin": "/\\*",
          "end": "\\*/"
        },
        {
          "name": "comment.line.double-slash.proto",
          "begin": "//",
          "end": "$\\n?"
        }
      ]
    }

For me, this looks like a bug, however in the example of the readme, there is also code, fixing the length values.

int startIndex = (token.StartIndex > line.Length) ? line.Length : token.StartIndex;
int endIndex = (token.EndIndex > line.Length) ? line.Length : token.EndIndex;

Is this a nasty fix, or is it intended behavior? Why do I need this behavior? I would like to use the length values to detect possible parsing errors, but without knowing the behavior, this is not possible.

danipen commented 3 months ago

Probably the API is a bit confusing, and just to clarify, please note that in a "real world" application, the length and the endIndex properties are not used. In that example to calculate the endIndex, it's taking into account the next token's startIndex or the line length if the next token is null.

So, right now it's working like this, imagine the following line:

// comment

These are the indexes:

0 1 2 3 4 5 6 7 8 9
/ /   c o m m e n t

Right now, TextMate is returning the following tokens:

tokens[0]: startIndex=0; endIndex=2; length=2;
tokens[1]: startIndex=2; endIndex=10; length=8;

The startIndex is inclusive, and the endIndex is not inclusive, so it works like an interval [startIndex, endIndex). The length is a calculated property.

You can see here the implementation details.

I agree that probably the API would be more intuitive using the following values:

tokens[0]: startIndex=0; endIndex=1; length=2;
tokens[1]: startIndex=2; endIndex=9; length=8;

This implementation is a port of tm4e, so just copied the same behavior in that repo and I didn't want to change the behavior to match the implementation in the upstream repository.

thempen commented 3 months ago

Thank you very much for the quick response! Now it is clear to me.