antlr / grammars-v4

Grammars written for ANTLR v4; expectation that the grammars are free of actions.
MIT License
10.09k stars 3.69k forks source link

[bug]antlr4 deal with http protocol #1407

Open 0x9k opened 5 years ago

0x9k commented 5 years ago

Hi, I use antlr4 to deal with http protocol according to the rfc. When I define this grammar

image

   OWS
    :   (SP | HTAB)*
    ;

  SP
    :   ' '
    ;

  HTAB
    :   '\t'
    ;

which get mismatched input ' ' expecting OWS in idea plugin

have any good idea to solve the problem? Thanks a lot.

the all grammar as follow:

grammar test;

/*
    HTTP-message =
    start‑line
    *( header‑field  CRLF )
    CRLF
    [ message‑body ]
*/
http_message
    :   start_line (header_field CRLF)* CRLF message_body?
    ;

/*
    start-line =
    request‑line /  status‑line
*/
start_line
    :   request_line
    ;

/*
    request-line =
    method  SP  request‑target  SP  HTTP‑version  CRLF
*/
request_line
    :   method SP request_target SP http_version CRLF
    ;

/*
    method =
    token
    ; "GET"
    ; → RFC 7231 – Section 4.3.1
    ; "HEAD"
    ; → RFC 7231 – Section 4.3.2
    ; "POST"
    ; → RFC 7231 – Section 4.3.3
    ; "PUT"
    ; → RFC 7231 – Section 4.3.4
    ; "DELETE"
    ; → RFC 7231 – Section 4.3.5
    ; "CONNECT"
    ; → RFC 7231 – Section 4.3.6
    ; "OPTIONS"
    ; → RFC 7231 – Section 4.3.7
    ; "TRACE"
    ; → RFC 7231 – Section 4.3.8
*/
method
    :   'GET'
    |   'HEAD'
    |   'POST'
    |   'PUT'
    |   'DELETE'
    |   'CONNECT'
    |   'OPTIONS'
    |   'TRACE'
    ;

/*
    SP =
    %x20
    ; space
*/
SP
    :   ' '
    ;

/*
    request-target =
    origin-form /  absolute-form /  authority-form /  asterisk-form
*/
request_target
    :   origin_form
    ;

/*
    origin-form =
    absolute-path  [ "?"  query ]
*/
origin_form
    :   absolute_path ('?' query)?
    ;

/*
    absolute-path =
    1*(  "/"  segment )
*/
absolute_path
    :   ('/' segment)+
    ;

/*
    segment =
    *pchar
*/
segment
    :   pchar*
    ;

/*
    pchar =
    unreserved /  pct‑encoded /  sub‑delims /  ":" /  "@"
*/
pchar
    :   unreserved  |   pct_encoded |   sub_delims  |   ':' |   '@'
    ;

/*
    unreserved =
    ALPHA /  DIGIT /  "-" /  "." /  "_" /  "~"
*/
unreserved
    :   ALPHA   |   DIGIT   |   '-' |   '.' |   '_' |   '~'
    ;

/*
    ALPHA =
    %x41‑5A /  %x61‑7A
    ; A‑Z  /  a‑z
*/
ALPHA
    :   [A-Za-z]
    ;

/*
    DIGIT =
    %x30‑39
    ; 0-9
*/
DIGIT
    :   [0-9]
    ;

/*
    pct-encoded =
    "%"  HEXDIG  HEXDIG
*/
pct_encoded
    :   '%' HEXDIG HEXDIG
    ;

/*
    HEXDIG =
    DIGIT /  "A" /  "B" /  "C" /  "D" /  "E" /  "F"
*/
HEXDIG
    :   DIGIT   |   'A' |   'B' |   'C' |   'D' |   'E' |   'F'
    ;

/*
    sub-delims =
    "!" /  "$" /  "&" /  "'" /  "(" /  ")" /  "*" /  "+" /  "," /  ";" /  "="
*/
sub_delims
    :   '!' |   '$' |   '&' |   '\''    |   '(' |   ')' |   '*' |   '+' |   ',' |   ';' |   '='
    ;

/*
    query =
    *(  pchar /  "/" /  "?" )
*/
query
    :   (pchar | '/' | '?')*
    ;

/*
    HTTP-version =
    HTTP-name '/'  DIGIT  "."  DIGIT
*/
http_version
    :   http_name DIGIT '.' DIGIT
    ;

/*
    HTTP-name =
    %x48.54.54.50
    ; "HTTP", case-sensitive
*/
http_name
    :   'HTTP/'
    ;

/*
    CRLF =
    CR  LF
    ; Internet standard newline
*/
CRLF
    :   '\n'
    ;

/*
    header-field =
    field-name  ":"  OWS  field-value  OWS 
*/
header_field
    :   field_name ':' OWS field_value OWS
    ;

/*
    field-name =
    token
*/
field_name
    :   token
    ;

/*
    token
*/
token
    :   tchar+
    ;

/*
    tchar =
    "!" /  "#" /  "$" /  "%" /  "&" /  "'" /  "*" /  "+" /  "-" /  "." /  "^" /  "_" /  "`" /  "|" /  "~" /  DIGIT /  ALPHA
*/
tchar
    :   '!' |   '#' |   '$' |   '%' |   '&' |   '\''    |   '*' |   '+' |   '-' |   '.' |   '^' |   '_' |   '`' |   '|' |   '~' |   DIGIT   |   ALPHA
    ;

/*
    OWS =
    *( SP /  HTAB )
    ; optional whitespace
*/
OWS
    :   (SP | HTAB)*
    ;

/*
    HTAB =
    %x09
    ; horizontal tab
*/
HTAB
    :   '\t'
    ;

/*
    field-value =
    *( field-content /  obs-fold )
*/
field_value
    :   (field_content | obs_fold)*
    ;

/*
    field-content =
    field-vchar  [ 1*( SP  /  HTAB )  field-vchar ]
*/
field_content
    :   field_vchar ((SP | HTAB)+ field_vchar)?
    ;

/*
    field-vchar =
    VCHAR /  obs-text
*/
field_vchar
    :   VCHAR
    |   obs_text
    ;

/*
    VCHAR =
    %x21-7E
    ; visible (printing) characters
*/
VCHAR
    :   [\u0021-\u007e]
    ;

/*
    obs-text =
    %x80-FF
*/
obs_text
    :   OBS_TEXT
    ;
OBS_TEXT
    :   [\u0080-\u00ff]
    ;

/*
    obs-fold =
    CRLF  1*( SP /  HTAB )     ; see  RFC 7230 – Section 3.2.4
*/
obs_fold
    :   CRLF (SP | HTAB)+
    ;

/*
    message-body =
    *OCTET
*/
message_body
    :   OCTET*
    ;

/*
    OCTET =
    %x00-FF
    ; 8 bits of data
*/
OCTET
    :   [\u0000-0x00ff]
    ;
Marti2203 commented 5 years ago

Hello. Could you give me some test data, which both of us can make tests on?

0x9k commented 5 years ago

Hello. Could you give me some test data, which both of us can make tests on?

Sure. For example

POST /url?sa=t&source=web&rct=j&url=https://zh.wikipedia.org/zh-hans/111&ved=2ahUKEwjhwLuRtbjiAhUPRK0KHRSjDpwQFjAKegQIAxAB HTTP/1.1
Host: www.google.com.hk
Connection: close
Content-Length: 4
Ping-From: https://www.google.com.hk/search?safe=strict&ei=gx3qXOKuJ4a8tgX-ypWIDA&q=111&oq=111&gs_l=psy-ab.3..0l10.15337.16373..16590...0.0..0.783.890.0j1j6-1......0....1..gws-wiz.....0.hUqCCrrBI9s
Origin: https://www.google.com.hk
Cache-Control: max-age=0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36
Ping-To: https://zh.wikipedia.org/zh-hans/111
Content-Type: text/ping
Accept: */*
X-Client-Data: CIi2yQEIorbJAQjBtskBCKmdygEIqKPKAQjwpMoBCLGnygEI4qjKAQjxqcoBCK+sygEYz6rKAQ==
Accept-Encoding: gzip, deflate
Accept-Language: zh-CN,zh;q=0.9,en;q=0.8
Cookie: NID=184=VqX86iUz6p-H_b2qbuogwjkmsk096DB-48jilOI9Pquzq8WT-aRbKsaH8UnMfvF9uHtuUtHhnJ7Z3F74bcpMNstJ5ADYV_tv09sXOJiwf3Yu-xsZ1E588v2tX6zA-J4K6c1t6t_PQP3jvtbVSdqw_YJqgU1elwvqkjzj0kBbk0I; 1P_JAR=2019-05-26-05; DV=42xzl48Lt5gpEFuauBIUhN0LQjoor5YtIbbBr4x5AQIAAAA

PING

OWS is optional whitespace

Marti2203 commented 5 years ago

Perfect. I will begin work on this tomorrow as I am away from my computer today.

0x9k commented 5 years ago

Perfect. I will begin work on this tomorrow as I am away from my computer today.

Cool. ;)

Marti2203 commented 5 years ago

First issue I see is the octet definition, which I reworked as

OCTET
    :   '\u0000' .. '\u00ff'
    ;
Marti2203 commented 5 years ago

Second issue is that OWS can match the empty string which is a warning that must be taken care of.

Marti2203 commented 5 years ago

Third is 'rule http_message contains an optional block with at least one alternative that can match an empty string'. The culprit was message_body, which is optional and can contain no octets. There are 2 posibilites I see: 1) no ? on message_body , 2) message_body is OCTET+ This is what I see, if everything is okay, I will look at the grammar and make a PR for it to be added to the repository.

Marti2203 commented 5 years ago

It is not easy to display the tree in grun as every character is a node, which I think is not cool, so I will deviate from the RFC and the OWS issue is still present...

0x9k commented 5 years ago

It is not easy to display the tree in grun as every character is a node, which I think is not cool, so I will deviate from the RFC and the OWS issue is still present...

Thank you for reply.I forget to say,the whole http protocol should add the follow point.

1、First,start-line in RFC include status-line which is using for response.

2、Second,request-target is also include absolute-form(request to a proxy, other than a CONNECT or server-wide OPTIONS request)、authority-form(is only used for CONNECT requests)、asterisk-form(is only used for a server-wide OPTIONS request).

RFC addr

0x9k commented 5 years ago

Second issue is that OWS can match the empty string which is a warning that must be taken care of.

yeah,OWS is definition as optional whitespace

RFC addr1

RFC add2

Marti2203 commented 5 years ago

I am sorry, but I do not have the time today to fully implement the http protocol grammar as I see it is big, but I will fix the current issue. One of the biggest subproblems is that the lexer tokens are overlapping and maybe lexer modes will need to be added.

Marti2203 commented 5 years ago

Second issue is that OWS can match the empty string which is a warning that must be taken care of.

yeah,OWS is definition as optional whitespace

RFC addr1

RFC add2

I understand that, but I do not think empty strings are the way to solve this requirement. Making OWS be an optional element that can catch an unlimited number of whitespace is better.

0x9k commented 5 years ago

Third is 'rule http_message contains an optional block with at least one alternative that can match an empty string'. The culprit was message_body, which is optional and can contain no octets. There are 2 posibilites I see: 1) no ? on message_body , 2) message_body is OCTET+ This is what I see, if everything is okay, I will look at the grammar and make a PR for it to be added to the repository.

Thank you for pointing out the problems.I'm trying again to test it.

Marti2203 commented 5 years ago

No problem! :) Glad to help!

teverett commented 5 years ago

@Marti2203 Could you consider submitting a PR to add your HTTP grammar to grammars-v4?

Marti2203 commented 5 years ago

If @0x9k is sure that this is the full grammar in his example, then I will be glad to.

0x9k commented 5 years ago

If @0x9k is sure that this is the full grammar in his example, then I will be glad to.

Thank you for your working.I'm pretty sure,let's submitting pr and add this grammar to grammars-v4. ;)

hoshsadiq commented 4 years ago

Hello, all I'm trying to use this http grammar, but I'm struggling to get the same issue working. I'm using the grammar that was committed to this repo in #1446 together with the test data in the repo and I get these errors:

line 2:5 extraneous input ' ' expecting {ALPHA, DIGIT, '\n', OWS, VCHAR, OBS_TEXT}
line 2:9 missing '\n' at '.'
line 2:20 missing ':' at '\n'
line 3:0 missing {' ', '\t'} at 'C'
line 3:10 mismatched input ':' expecting {' ', ALPHA, DIGIT, '\n', OWS, '\t', VCHAR, OBS_TEXT}

Any ideas what I'm doing wrong?

teverett commented 4 years ago

Well it's possible its a grammar bug. Can you provide your input file?

hoshsadiq commented 4 years ago

The input is:

POST / HTTP/1.1
Host: www.google.com
Connection: close
Content-Length: 4
Ping-From: https://www.google.com.hk/search?safe=strict&ei=gx3qXOKuJ4a8tgX-ypWIDA&q=111&oq=111&gs_l=psy-ab.3..0l10.15337.16373..16590...0.0..0.783.890.0j1j6-1......0....1..gws-wiz.....0.hUqCCrrBI9s
Origin: https://www.google.com.hk
Cache-Control: max-age=0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36
Ping-To: https://zh.wikipedia.org/zh-hans/111
Content-Type: text/ping
Accept: */*
X-Client-Data: CIi2yQEIorbJAQjBtskBCKmdygEIqKPKAQjwpMoBCLGnygEI4qjKAQjxqcoBCK+sygEYz6rKAQ==
Accept-Encoding: gzip, deflate
Accept-Language: zh-CN,zh;q=0.9,en;q=0.8
Cookie: NID=184=VqX86iUz6p-H_b2qbuogwjkmsk096DB-48jilOI9Pquzq8WT-aRbKsaH8UnMfvF9uHtuUtHhnJ7Z3F74bcpMNstJ5ADYV_tv09sXOJiwf3Yu-xsZ1E588v2tX6zA-J4K6c1t6t_PQP3jvtbVSdqw_YJqgU1elwvqkjzj0kBbk0I; 1P_JAR=2019-05-26-05; DV=42xzl48Lt5gpEFuauBIUhN0LQjoor5YtIbbBr4x5AQIAAAA

PING

Also, I just realised, the message_body is commented out inside the repo, which I've uncommented, so just for good measure, this is the grammar I've got:

grammar http;

/*
 HTTP-message = start‑line ( header‑field  CRLF ) CRLF message‑body
 */
http_message: start_line (header_field CRLF)* CRLF message_body ;

/*
 start-line = request‑line / status‑line
 */
start_line: request_line;

/*
 request-line = method  SP  request‑target  SP  HTTP‑version  CRLF
 */
request_line: method SP request_target SP http_version CRLF;

/*
 method =
    token
    ; "GET"     → RFC 7231 – Section 4.3.1
    ; "HEAD"    → RFC 7231 – Section 4.3.2
    ; "POST"    → RFC 7231 – Section 4.3.3
    ; "PUT"     → RFC 7231 – Section 4.3.4
    ; "DELETE"  → RFC 7231 – Section 4.3.5
    ; "CONNECT" → RFC 7231 – Section 4.3.6
    ; "OPTIONS" → RFC 7231 – Section 4.3.7
    ; "TRACE"   → RFC 7231 – Section 4.3.8
 */
method:
    'GET'
    | 'HEAD'
    | 'POST'
    | 'PUT'
    | 'DELETE'
    | 'CONNECT'
    | 'OPTIONS'
    | 'TRACE';

/*
 request-target = origin-form / absolute-form / authority-form / asterisk-form
 */
request_target: origin_form;

/*
 origin-form = absolute-path  [ "?"  query ]
 */
origin_form: absolute_path (QuestionMark query)?;

/*
 absolute-path = 1*( "/"  segment )
 */
absolute_path: (Slash segment)+;

/*
 segment = pchar
 */
segment: pchar*;

/*
 query = ( pchar /  "/" /  "?" )
 */
query: (pchar | Slash | QuestionMark)*;

/*
 HTTP-version = HTTP-name '/' DIGIT  "."  DIGIT
 */
http_version: http_name DIGIT Dot DIGIT;

/*
 HTTP-name = %x48.54.54.50 ; "HTTP", case-sensitive
 */
http_name: 'HTTP/';

/*
 header-field = field-name  ":"  OWS  field-value  OWS 
 */
header_field: field_name Colon OWS* field_value OWS*;

/*
 field-name = token
 */
field_name: token;

/*
 token
 */
token: tchar+;
/*
 field-value = ( field-content / obs-fold )
 */
field_value: (field_content | obs_fold)+;

/*
 field-content = field-vchar [ 1*( SP / HTAB )  field-vchar ]
 */
field_content: field_vchar ((SP | HTAB)+ field_vchar)?;

/*
 field-vchar = VCHAR / obs-text
 */
field_vchar: vCHAR | obs_text;
/*
 obs-text = %x80-FF
 */
obs_text: OBS_TEXT;
/*
 obs-fold = CRLF  1*( SP / HTAB ) ; see RFC 7230 – Section 3.2.4
 */
obs_fold: CRLF (SP | HTAB)+;

/*
 message-body = OCTET
 */
message_body: OCTET*;

/*
 SP = %x20 ; space
 */
SP: ' ';
/*
 pchar = unreserved / pct‑encoded / sub‑delims / ":" / "@"
 */
pchar: unreserved | Pct_encoded | sub_delims | Colon | At;

/*
 unreserved = ALPHA /  DIGIT /  "-" /  "." /  "_" /  "~"
 */
unreserved: ALPHA | DIGIT | Minus | Dot | Underscore | Tilde;

/*
 ALPHA = %x41‑5A /  %x61‑7A ; A‑Z / a‑z
 */
ALPHA: [A-Za-z];

/*
 DIGIT = %x30‑39 ; 0-9
 */
DIGIT: [0-9];

/*
 pct-encoded = "%"  HEXDIG  HEXDIG
 */
Pct_encoded: Percent HEXDIG HEXDIG;

/*
 HEXDIG = DIGIT /  "A" /  "B" /  "C" /  "D" /  "E" /  "F"
 */
HEXDIG: DIGIT | 'A' | 'B' | 'C' | 'D' | 'E' | 'F';

/*
 sub-delims = "!" /  "$" /  "&" /  "'" /  "(" /  ")" /  "*" /  "+" /  "," /  ";" /  "="
 */
sub_delims:
    ExclamationMark
    | DollarSign
    | Ampersand
    | SQuote
    | LColumn
    | RColumn
    | Star
    | Plus
    | SemiColon
    | Period
    | Equals;

LColumn     : '(';
RColumn     : ')';
SemiColon   : ';';
Equals      : '=';
Period      : ',';

/*
 CRLF = CR  LF ; Internet standard newline
 */
CRLF: '\n';

/*
 tchar = "!" /  "#" /  "$" /  "%" /  "&" /  "'" /  "*" /  "+" /  "-" /  "." /  "^" /  "_" /  "`" / 
 "|" /  "~" /  DIGIT /  ALPHA
 */
tchar:
      ExclamationMark
    | DollarSign
    | Hashtag
    | Percent
    | Ampersand
    | SQuote
    | Star
    | Plus
    | Minus
    | Dot
    | Caret
    | Underscore
    | BackQuote
    | VBar
    | Tilde
    | DIGIT
    | ALPHA;

Minus           : '-';
Dot             : '.';
Underscore      : '_';
Tilde           : '~';
QuestionMark    : '?';
Slash           : '/';
ExclamationMark : '!';
Colon           : ':';
At              : '@';
DollarSign      : '$';
Hashtag         : '#';
Ampersand       : '&';
Percent         : '%';
SQuote          : '\'';
Star            : '*';
Plus            : '+';
Caret           : '^';
BackQuote       : '`';
VBar            : '|';

/*
 OWS = ( SP / HTAB ) ; optional whitespace
 */
OWS: SP | HTAB;

/*
 HTAB = %x09 ; horizontal tab
 */
HTAB: '\t';

/*
 VCHAR = %x21-7E ; visible (printing) characters
 */
vCHAR: ALPHA | DIGIT | VCHAR;

VCHAR:
      ExclamationMark
    | '"'
    | Hashtag
    | DollarSign
    | Percent
    | Ampersand
    | SQuote
    | LColumn
    | RColumn
    | RColumn
    | Star
    | Plus
    | Period
    | Minus
    | Dot
    | Slash
    | Colon
    | SemiColon
    | '<'
    | Equals
    | '>'
    | QuestionMark
    | At
    | '['
    | '\\'
    | Caret
    | Underscore
    | ']'
    | BackQuote
    | '{'
    | '}'
    | VBar
    | Tilde;

OBS_TEXT: '\u0080' .. '\u00ff' ;

/*
 OCTET = %x00-FF ; 8 bits of data
 */
OCTET: '\u0000' .. '\u00ff' ;

Preview: image

hoshsadiq commented 4 years ago

Any one able to help at all?

hoshsadiq commented 4 years ago

So I've narrowed it down to the following:

grammar http;
header_field: 'H:' OWS* 'a' OWS*;
SP: ' ';
OWS: SP | HTAB;
HTAB: '\t'

with the following input:

H: a

which gives me the following error:

line 1:2 extraneous input ' ' expecting {'a', OWS}

And the following parse tree: image

Oddly, when I move the definition of SP to below the definition of OWS, it works fine. Though doing the same in the real http.g4 grammar file makes things worse. Removing the space in the input also works, but this is not ideal.

Any help would be much appreciated.

teverett commented 4 years ago

I've been working on this here if anyone wants to jump in.

https://github.com/teverett/grammars-v4/tree/http

hoshsadiq commented 4 years ago

I don't know what I'm doing wrong, but that branch gives me the following errors:

line 1:5 mismatched input '/url?sa=t&source=web&rct=j&url=https://zh.wikipedia.org/zh-hans/111&ved=2ahUKEwjhwLuRtbjiAhUPRK0KHRSjDpwQFjAKegQIAxAB' expecting '/'
line 1:123 mismatched input 'HTTP/1.1' expecting 'HTTP/'
line 2:0 mismatched input 'Host:' expecting {'\n', TCHAR}

I've not changed anything and using the test case in that branch as well.

teverett commented 4 years ago

@hoshsadiq its a work in progress.

hoshsadiq commented 4 years ago

@teverett I noticed the branch is gone without it being merged. were you ever able to finish it?

Jacopobracaloni commented 7 months ago

Hello @teverett , First thank you for the effort in providing a grammar for the HTTP protocol requests. It seems that I am finding the same bugs as the other users in this thread; so I was wondering if you had the chance to manage them or if the project is in a standby. Thank you for your time!