case_statement do not include first comment line in scope_statement_list:

FabArd commented 1 year ago

Hi,

In the code below, the comment is not included in the "scope_statement_list:" of "case_statement:" which makes indentation incorrect for this line.

override bool handleAction(const Action a) {
    switch (a.id) {
        case IDEActions.EditPreferences:
            //showPreferences();
            return true;
        default:
            return super.handleAction(a);
    }
    return false;
}

                    case_statement:
                      argument_list:
                        postfix_expression:
                          primary_expression:
                            identifier:
                          identifier:
                      line_comment:
                      scope_statement_list:
                        statement_list_no_case_no_default:
                          return_statement:
                            expression:
                              primary_expression:

All lines after a "case" should be included in the scope of this "case" even if it is a comment.

CyberShadow commented 1 year ago

Hmm, both interpretations are correct... What about:

case 1: // comment
case 2:

Here, there is no StatementList, only a LineComment.

What do you think?

How do other grammars solve this problem?

WebFreak001 commented 1 year ago

in libdparse

case 1: // comment
case 2:
    foo();
    break;
case 3:
    bar();
    goto default;
default:
    baz();

would be

CaseStatement 1 []
CaseStatement 2 [foo(), BreakStatement]
CaseStatement 3 [bar(), GotoStatement]
DefaultStatement [baz()]

a bunch of fluff around each ast node. Basically libdparse differentiates between parseStatements and parseStatementsNoCaseNoDefault. The latter is used inside cases and defaults to stop at case/default statements.

CyberShadow commented 1 year ago

Well, that's not the problem - the question is what to do with the comment node.

Could this be handled by an indentation rule instead of changing the tree-sitter grammar?

WebFreak001 commented 1 year ago

we changed to what Roslyn does, they describe it in a bunch of detail. Basically the lexer first lexes all the tokens, then all comment and whitespace tokens, called "trivia", are moved into adjacent "source tokens". AST nodes then contain a slice of their source tokens, which in turn reference the trivia tokens, so you can reconstruct source code exactly as it was input, mostly without caring for whitespace or comments when moving stuff around.

I don't think it's a good idea to include comments inside the AST as dedicated nodes because they can basically appear anywhere and would make AST processing a whole lot more painful.

CyberShadow commented 1 year ago

Thank you for the insight. This repository hosts (and this issue discusses) an implementation of a D grammar for tree-sitter. As such, we need to have comments as separate nodes in order to be able to syntax-highlight (or, here, indent) them. I may have misunderstood and I apologize if I did, but your suggestion is inapplicable to the issue at hand.

The D grammar (and therefore this tree-sitter grammar, which is generated from the D grammar) does allow comments to appear anywhere. In the D grammar this is done by describing this informally, but in tree-sitter a special feature is used ("extras"). The feature still does cause comments to appear as nodes in the tree for the reasons above.

FabArd commented 1 year ago

To define the context of the problem, we have two entities:

the d language and its syntax which is used by its compiler
tree-sitter

The objectives of each are different:

the syntax of the D language is to allow its compiler to generate machine code
tree-sitter allows the programmer to write programs to be compiled by the compiler

Both objectives are as essential as each other.

Indeed a comment is totally useless for a compiler: it will not generate any code. On the other hand, for tree-sitter a comment is very important because it must be highlighted by the coloring and indented so that the programmer can take it into account.

It seems to me that the problem is not the comment but where it should be placed in the AST :

As WebFreak001 mentioned each "case" should have a block even if it is empty:

CaseStatement 1 []
CaseStatement 2 [foo(), BreakStatement]
CaseStatement 3 [bar(), GotoStatement]
DefaultStatement [baz()]

So in my opinion (correct me if I'm wrong), the solution to this problem would be to systematically have a "scope_statement_list:" node after each "case".

For the following code :

switch (X) {
    case test1: // Comment1
    case test2:
        // Comment2
    case test3:
        // Comment3;
        return true;
    default:
        return false;
}

We should have :

switch_statement:
  expression:
    primary_expression:
      identifier:
  block_statement:
    statement_list:
      case_statement:  --> case test1:
        argument_list:
          primary_expression:
            identifier:
        scope_statement_list:
          line_comment:  -- > //comment 1
      case_statement:  --> case test2:
        argument_list:
          primary_expression:
            identifier:
        scope_statement_list:
          line_comment:  -- > //comment 2
      case_statement:  --> case test3:
        argument_list:
          primary_expression:
            identifier:
        scope_statement_list:
          line_comment: -- > //comment 3
            return_statement:
              expression:
                primary_expression:
      default_statement:  --> default case
        scope_statement_list:
          return_statement:
            expression:
              primary_expression:

We have a similar problem with the 'if' without block. But I'll open a new issue to not complicate this one.

CyberShadow commented 1 year ago

So in my opinion (correct me if I'm wrong), the solution to this problem would be to systematically have a "scope_statement_list:" node after each "case".

Indeed, but I am uneasy about this approach because 1) it deviates from the formal language grammar, and 2) it changes the meaning of the code. The presence of a statement list is significant, due to things like fallthrough and case ranges. Consider e.g.:

switch (expr)
{
// Indent this where?
case 1:
// Indent this where?
..
// Indent this where?
case 3:
// Indent this where?
case 4:
// Indent this where?
    break;

// Indent this where? This could be a description of case 5 below.
case 5:

I think an indentation rule would make more sense here, would it be possible to do that instead?

FabArd commented 1 year ago

Why did you say : "it deviates from the formal language grammar" :

11.12.1 No Implicit Fall-Through

A ScopeStatementList must either be empty, or be ended with a ContinueStatement, BreakStatement,
ReturnStatement, GotoStatement, ThrowExpression or assert(0) expression unless this is the last case.

It is clearly written : "must either be empty" so it does not changes the meaning of the code.

Logically everything that is written after the sign ":" of a case means that they belong to the scope of this case. In other words, the scope of a "case" starts at ":" and ends at the next case, default or "}" of the switch.

In you example the indentation should be (whith an indentation length of 4):

switch (expr)
{
    // this comment belong to the switch scope
    case 1:
        // this comment belong to case 1 scope
        ..
        // this comment belong to case 1 scope also
    case 3:
        // this comment belong to case 3 scope
    case 4:
        // this comment belong to case 4 scope
        break;

        // this comment belong to case 4 scope also
    case 5:
...

You can also have :

switch (expr)
{
    case 1: //this comment belong to case 1 scope
        //this comment belong to case 1 scope also
    case 2:
    ...
}

CyberShadow commented 1 year ago

Because, depending on how you look at it, an empty statement list is not the same as the absence of a statement list. Otherwise, the grammar is redundant / ambiguous / erroneous in allowing both an empty ScopeStatementList and the absence of a ScopeStatementList, and we should fix it there.

CyberShadow commented 1 year ago

Looking at this again, the above can't work because tree-sitter does not support potentially-empty rules (rules that match the empty string) and the D grammar doesn't have any such rules either.

Honestly I don't see how it's even possible to fix it in the tree-sitter (or D) grammar. The comment rules are in extras. We can't make ScopeStatementList non-optional. We don't have a mechanism of telling tree-sitter where to put the comment nodes. The only way to fix this would be using an indentation rule. If we had a mechanism that would allow moving extras nodes like comments around after the tree is built, such a fix could go there, but there is no such mechanism.

As such, as far as I can see, this is not possible to fix by changing anything in this repository. If someone would like to contribute an initial set of indentation rules, we can look at fixing it there, but currently there is nothing to fix... so I'm going to close this issue for now.

CyberShadow commented 1 year ago

@FabArd Would you like to submit the work you've done for integrating this grammar into editors so far? We can then iterate to improve it here.

CyberShadow / tree-sitter-d

case_statement do not include first comment line in scope_statement_list: #10