antlr / grammars-v4

Grammars written for ANTLR v4; expectation that the grammars are free of actions.
MIT License
10.11k stars 3.69k forks source link

Parsing Lua long comment as short comment #3741

Open bendrissou opened 12 months ago

bendrissou commented 12 months ago

Hi,

The following valid Lua input cannot be successfully parsed by the current antlr grammar:

local function name_s ( ) --[===[]===]  end    

The error output:

line 1:26 mismatched input '--[===[]===]  end    ' expecting 'end'

After some inspection, I found that the parser parses the input --[===[]===] end as a line comment (short comment). Then complains about a missing end token.

I think the solution would be to set parsing preference for long comment over short comment.

msagca commented 12 months ago

Hi @bendrissou

Updating the line comment as follows resolves this issue but I'm not sure if this change complies with the spec.

LINE_COMMENT
    : '--' ( ~[\r\n\u005b\u0085\u2028\u2029] SingleLineInputCharacter* )? -> channel(HIDDEN)
    ;
msagca commented 12 months ago

From the spec:

A comment starts with a double hyphen (--) anywhere outside a string. If the text immediately after -- is not an opening long bracket, the comment is a short comment, which runs until the end of the line. Otherwise, it is a long comment, which runs until the corresponding closing long bracket.

bendrissou commented 11 months ago

Hi @msagca

That does resolve the issue. But now we can't have line comments that start with the symbol [.

msagca commented 11 months ago

@bendrissou

I'm not sure what you mean by line comments starting with [, can you give an example?

bendrissou commented 11 months ago

Here is a short example:

local function name_s ( ) --[aaa      
end

This is rejected by the new grammar. But accepted by the Lua compiler.

msagca commented 11 months ago

@bendrissou

The paragraph I quoted earlier from the spec suggests that --[aaa shall be treated as a long comment. In this case, you could prepend an additional hyphen to make it ---[aaa, then it should be recognized as a short comment. Am I getting it wrong?

bendrissou commented 11 months ago

Hi @msagca

Yes, you are right. This conforms to the spec.

Though the official implementation seems to treat --[aaa as a short comment, which should not be the case. Instead it should be an incomplete long comment.

kaby76 commented 11 months ago

Though the official implementation seems to treat --[aaa as a short comment, which should not be the case. Instead it should be an incomplete long comment.

Lua 5.4.6 treats --[aaa as a line comment, not a multi-line comment. The doc states that a opening long bracket involves equal-signs. Trying many different examples and reading the lexer source code confirms this. If it does not satisfy a standardized opening long bracket, the fall-through is a line comment.

I have a PR for Lua. I intend to add code to lex comments properly. Unfortunately, the code must involve counting the number of '='-signs (one can nest multi-line comments), which means the grammar must now be split and written with target-specific base class code.

Dongyang0810 commented 11 months ago

3652

kaby76 commented 11 months ago

3652

@Dongyang0810 PR https://github.com/antlr/grammars-v4/pull/3752 handles this completely. There are two inputs you give, one before the picture in the initial comment, and the second input shown in the picture. I'll go through both.

Input 1:

--[[some comment ]] local A = 10

The parse tree is: 1

This is correct. The long comment ends on line 1, column 19, and the non-commented code begins on column 20.

Input 2:

--[[aaa]]AA=1
--[a]=11
a=1

The parse tree is: 2

This is correct. The line comment on line 1 ends on column 9, and the statement begins on line 1, column 10. The second line is completely a single-line comment because long commands must have double square brackets, with 0 to n equal-signed nested between the double square brackets. In fact, the number of equal signs must match. If it's not a long comment, it is a single line comment, which terminates at the end of line 2. Line 3 is a new assignment statement.

Discussion

Long comments cannot be lexed without semantic predicates. The closing long comment bracket must including counting of the number of equal signs, just as the lexer in the lua source code does. In fact, the PR defines the functions with the same name and almost the same code as in the lua source code, which means it should be easier to maintain if the source code itself changes. I.e., just do what the lua interpreter does--don't think. (In fact, I was debugging the Antlr-generated parser side by side with a debugger on the C-code for lua.) That said, I am disappointed that the grammar given in the manual is not the exact same grammar as implemented in the lua parser source code.