antlr / grammars-v4

Grammars written for ANTLR v4; expectation that the grammars are free of actions.
MIT License
10.1k stars 3.69k forks source link

[swift]improve multiline nested comment lexer rule for swift2&swift3&swift5 #4184

Open hollowrider opened 1 month ago

hollowrider commented 1 month ago

The Block_comment lexer rule can't handle comment like /*/**/ .It will conduct an un expected error. Current Block_comment rule is this:

Block_comment: '/*' (Block_comment | .)*? '*/' -> channel(HIDDEN);

To fix that, I make a little change on the Block_comment rule and there it is.

Block_comment: '/*' (Block_comment | '/' ~'*'|~'/')*? '*/' -> channel(HIDDEN);

This rule will refuse /* character in Block_comment and match the nested comment corrently. I find this kind of defeat existing in swift2&swift3&swift5 lexer file and maybe other grammar files that allow multiline nested comment.

Error swift code is below:

/*/**/
let _: [Any] = [
    0, 1, 1.0, 1.0e+1, 1e+1, true,
    "Hello, world!", "Hello, \(1)!", "Hello, \(1.0e+1)!", "Hello, \(Int.max)!",
    (nil == nil)
]
/** 
another comment
*/
if 10 < 20{
    if 10 < 20{
    }
}

when using origin Block_comment rule, it will tokenize like this:

[@0,0:228='/*/**/\r\n\r\nlet _: [Int] = []\r\nlet _ = [1, 2, 3]\r\nlet _: [Any] = [\r\n    0, 1, 1.0, 1.0e+1, 1e+1, true,\r\n    "Hello, world!", "Hello, \(1)!", "Hello, \(1.0e+1)!", "Hello, \(Int.max)!",\r\n    (nil == nil)\r\n]\r\n/** \r\nanother comment\r\n*/',<Block_comment>,channel=1,1:0]

After fixing this defeat, it will work like this. And when parsing grammar, it will throw exception as expected.

[@0,0:0='/',<'/'>,1:0]
[@1,1:1='*',<'*'>,1:1]
[@2,2:5='/**/',<Block_comment>,channel=1,1:2]
msagca commented 1 month ago

Hi @hollowrider,

Formal syntax rules associated with comments in the documentation are as follows:

comment → // comment-text line-break
multiline-comment → /* multiline-comment-text */
comment-text → comment-text-item comment-text?
comment-text-item → Any Unicode scalar value except U+000A or U+000D
multiline-comment-text → multiline-comment-text-item multiline-comment-text?
multiline-comment-text-item → multiline-comment
multiline-comment-text-item → comment-text-item
multiline-comment-text-item → Any Unicode scalar value except /* or */

Doesn't your input violate these rules since it contains an unmatched /*? It is not a nested comment because it's not a comment since it's not terminated by */. Maybe I'm interpreting the syntax rules wrong.

hollowrider commented 1 month ago

@msagca Thanks for your comment. Exactly, /*/**/ violate these rules. However, What problem I meet is when users input a swift file with grammar mistakes like this and parser give an unexpected output. Below swift input contain /*/**/ character and definitely it should raise an exception because it violates rules you list. However, when I parse this file, you will find no errors are thrown.

/*/**/
let _: [Any] = [
    0, 1, 1.0, 1.0e+1, 1e+1, true,
    "Hello, world!", "Hello, \(1)!", "Hello, \(1.0e+1)!", "Hello, \(Int.max)!",
    (nil == nil)
]
/** 
another comment
*/
if 10 < 20{
    if 10 < 20{
    }
}

And if you use grun token function to analyize this file, you will find the reason. The lexer recognizes the struct between line 1 and line 9 as the whole Block_comment or multiline-comment named in swift-book. Below is the lexer token result:

[@0,0:188='/*/**/\r\nlet _: [Any] = [\r\n    0, 1, 1.0, 1.0e+1, 1e+1, true,\r\n    "Hello, world!", "Hello, \(1)!", "Hello, \(1.0e+1)!", "Hello, \(Int.max)!",\r\n    (nil == nil)\r\n]\r\n/** \r\nanother comment\r\n*/',<Block_comment>,channel=1,1:0]

This isn't what I expect. To fix that, I suggest to change the Block_comment rule like below. Changed lexer will recognize the beginning / and * apart from following multiline-comment. And it will raise an error when grammar parses.

Block_comment: '/*' (Block_comment | '/' ~'*'|~'/')*? '*/' -> channel(HIDDEN);

There is the lexer output after changing the rule.

[@0,0:0='/',<'/'>,1:0]
[@1,1:1='*',<'*'>,1:1]
[@2,2:5='/**/',<Block_comment>,channel=1,1:2]

To be honest, I'm not an experienced antlr grammar writer, but I want to share the problem I meet and improve g4 file. Would you think it could work?