Closed MegaManSec closed 6 months ago
I meet a similar problem. Is there a solutions?
I'm stuck here too (this is so long time callback.XD
based on the problems in my own scenario and reviewing #14, #28, and #36, I think that the current recursive grammar provided by Grammar-Mutator easily increases the backtracking cost during syntax tree parsing during mutation, and eventually gets stuck in the grammar parsing process.
I make a minimal case to reproduce the above problem;
(because json can be regarded as a one-to-one correspondence with g4, for the convenience of description, I will use the g4 syntax of antlr4 for expression below.)
I make g4 grammar test.g4
:
grammar test;
entry: stmt1 '\n' EOF
;
stmt1: 'I' SPACE stmt2 SPACE 'like' SPACE 'C++'
;
stmt2: NODE
| NODE SPACE
| NODE SPACE NODE
| NODE SPACE NODE SPACE
| NODE SPACE NODE SPACE NODE
| NODE SPACE stmt2
;
NODE : 'very'
| 'little'
;
SPACE: ' '
;
and I make input 1024_very.txt
:
I very very ...(*1020)... very very like C++
use antlr4-parse
to parse as follows:
The above is a special case combination that can construct similar stack calls. In fact, we only need to write simple recursion normally, but reproducing the problem may require more
very
after analysis, it can be found that when antlr4 analyzes the above grammar, each round of recursion will check prefix matching, and its consumption of performance increases exponentially with the increase of recursion depth; it may eventually cause three phenomena:
going further, I found a way to mitigate;
based on the above issues, we create simpler test cases, test.json
:
{
"<entry>": [["I ", "<stmt1>", "like C++\n"]],
"<stmt1>": [["<NODE>", "<stmt1>"], []],
"<NODE>": [["very "]]
}
tanslate to test.g4
:
grammar test;
entry: 'I ' stmt1 'like C++\n' EOF
;
stmt1:
| NODE stmt1
;
NODE : 'very '
;
and input 40960_very.txt
:
I very very ...(*40956)... very very like C++
running with antlr4-parse
:
from the perspective of antlr4, we can use the +
syntax to describe test.g4
, and ignore this prefix matching, as follows test.g4
:
grammar test;
entry: 'I ' stmt1 'like C++\n' EOF
;
stmt1:
| (NODE)+
;
NODE : 'very '
;
running again with antlr4-parse
:
so I made a patch to implement the above ideas, please refer to https://github.com/0x7Fancy/Grammar-Mutator/commit/6eae7d14f579b4d3d1196fc1a288c61044a1afe1;
I have only implemented the optimization of head recursion and tail recursion here, which is simple and easy to understand. for intermediate recursion, I think it can be rewritten as head/tail recursion in json
of course, this is just a mitigation measure. When the mutation generates a sufficiently complex syntax tree, it may still cause antlr4 to get stuck in syntax parsing.
@h1994st if you think it's ok, I can send PR
@0x7Fancy Thanks for digging into this issue! Such an old issue lol
The fix looks good to me. Please go ahead and create a PR. I will merge it then.
I don't have the ability to test at the moment, but is there any chance somebody could check whether the HTTP grammar I created works with the patch?
{
"<A>": [["<START_LINE>", "\r\n", "<HEADERS>", "\r\n"]],
"<START_LINE>": [["<METHOD>", " ", "<URI>", " ", "<VERSION>"]],
"<METHOD>": [["NONE"], ["GET"], ["POST"], ["PUT"], ["HEAD"], ["CONNECT"], ["TRACE"], ["OPTIONS"], ["DELETE"], ["LINK"], ["UNLINK"], ["CHECKOUT"], ["CHECKIN"], ["UNCHECKOUT"], ["MKWORKSPACE"], ["VERSION"], ["REPORT"], ["UPDATE"], ["LABEL"], ["MERGE"], ["BASELINE"], ["MKACTIVITY"], ["ORDERPATCH"], ["ACL"], ["MKREDIRECTREF"], ["UPDATEREDIRECTREF"], ["MKCALENDAR"], ["PROPFIND"], ["PROPPATCH"], ["MKCOL"], ["COPY"], ["MOVE"], ["LOCK"], ["UNLOCK"], ["SEARCH"], ["PATCH"], ["BIND"], ["REBIND"], ["UNBIND"], ["PRI"], ["PURGE"], ["OTHER"], ["ENUM"]],
"<URI>": [["<SCHEME>" , ":", "<HIER>", "<QUERY>", "<FRAGMENT>"]],
"<SCHEME>": [["NONE"], ["cache_object"], ["HTTP"], ["FTP"], ["HTTPS"], ["COAP"], ["COAPS"], ["GOPHER"], ["WAIS"], ["CACHE"], ["ICP"], ["HTCP"], ["URN"], ["WHOIS"], ["ICY"], ["TLS"], ["SSL"], ["AUTHORITY"], ["UNKNOWN"], ["MAX"]],
"<HIER>": [["//", "<AUTHORITY>", "<PATH>"]],
"<AUTHORITY>": [["<USERINFO>", "<HOST>"]],
"<PATH>": [["/", "<DIR>"]],
"<DIR>": [[], ["<CHAR>", "/", "<DIR>"]],
"<USERINFO>": [[], ["<CHAR>", ":", "<CHAR>", "@"]],
"<HOST>": [["127.0.0.1:8080"]],
"<QUERY>": [[], ["?", "<CHAR>" , "=", "<CHAR>"]],
"<FRAGMENT>": [[], ["#", "<CHAR>"]],
"<VERSION>": [["HTTP/0.9"], ["HTTP/1.0"], ["HTTP/1.1"], ["HTTP/2.0"], ["ICY"]],
"<HEADERS>": [[], ["<HEADER>", "\r\n", "<HEADERS>"]],
"<HEADER>": [["<HEADER_FIELD>", ": ", "<ANY>"]],
"<HEADER_FIELD>": [["Accept"],["Accept-Charset"],["Accept-Encoding"],["Accept-Language"],["Accept-Ranges"],["Age"],["Allow"],["Alternate-Protocol"],["Authentication-Info"],["Authorization"],["Cache-Control"],["Cache-Status"],["CDN-Loop"],["Connection"],["Content-Base"],["Content-Disposition"],["Content-Encoding"],["Content-Language"],["Content-Length"],["Content-Location"],["Content-MD5"],["Content-Range"],["Content-Type"],["Cookie"],["Cookie2"],["Date"],["ETag"],["Expect"],["Expires"],["Forwarded"],["From"],["Host"],["HTTP2-Settings"],["If-Match"],["If-Modified-Since"],["If-None-Match"],["If-Range"],["If-Unmodified-Since"],["Keep-Alive"],["Key"],["Last-Modified"],["Link"],["Location"],["Max-Forwards"],["Mime-Version"],["Negotiate"],["Origin"],["Pragma"],["Proxy-Authenticate"],["Proxy-Authentication-Info"],["Proxy-Authorization"],["Proxy-Connection"],["Proxy-support"],["Public"],["Range"],["Referer"],["Request-Range"],["Retry-After"],["Server"],["Set-Cookie"],["Set-Cookie2"],["TE"],["Title"],["Trailer"],["Transfer-Encoding"],["Translate"],["Unless-Modified-Since"],["Upgrade"],["User-Agent"],["Vary"],["Via"],["Warning"],["WWW-Authenticate"],["X-Forwarded-For"],["X-Request-URI"],["X-Squid-Error"],["X-Accelerator-Vary"],["X-Next-Services"],["Surrogate-Capability"],["Surrogate-Control"],["Front-End-Https"],["FTP-Command"],["FTP-Arguments"],["FTP-Pre"],["FTP-Status"],["FTP-Reason"],["Other"]],
"<BODY>": [[], ["<CHAR>"]],
"<ANY>": [[], ["<DATE>"], ["<CHAR>"], ["<HOST>"], ["<URI>"], ["<SPECIAL>"]],
"<SPECIAL>": [["identity"], ["gzip"], ["deflate"], ["compress"], ["chunked"], ["bytes"], ["<SPECIAL>", ",", "<SPECIAL>"], [",", "<SPECIAL>"], ["\"", "<SPECIAL>", "\""], ["<SPECIAL>", "=", "<SPECIAL>"]],
"<DATE>": [["Sat, 29 Oct 1994 19:43:31 GMT"], ["Sat, 29 Oct 2021 19:43:31 GMT"]],
"<CHAR>": [[], ["<TCHAR>", "<CHAR>"]],
"<TCHAR>": [["0"], ["1"], ["2"], ["3"], ["4"], ["5"], ["6"], ["7"], ["8"], ["9"], ["a"], ["b"], ["c"], ["d"], ["e"], ["f"], ["g"], ["h"], ["i"], ["j"], ["k"], ["l"], ["m"], ["n"], ["o"], ["p"], ["q"], ["r"], ["s"], ["t"], ["u"], ["v"], ["w"], ["x"], ["y"], ["z"], ["A"], ["B"], ["C"], ["D"], ["E"], ["F"], ["G"], ["H"], ["I"], ["J"], ["K"], ["L"], ["M"], ["N"], ["O"], ["P"], ["Q"], ["R"], ["S"], ["T"], ["U"], ["V"], ["W"], ["X"], ["Y"], ["Z"], ["!"], ["\""], ["#"], ["$"], ["%"], ["&"], ["'"], ["("], [")"], ["*"], ["+"], [","], ["-"], ["."], ["/"], [":"], [";"], ["<"], ["="], [">"], ["?"], ["@"], ["["], ["\\"], ["]"]]
}
emmmmm, wait a moment, this patch seems to still have problems, I will continue to follow up
@MegaManSec hi, I reproduced your original problem("causes a segfault from AFL (maybe only on startup)")
I don’t know much about LL(*) grammar (antlr4). Based on my understanding of LL(1) grammar, I think there are some problems in your grammar file, causing antlr4 to fall into the backtracking parsing process.
based on the json grammar you provided, I found the minimal test case (g4):
grammar test;
entry
: node_DIR 'a' node_CHAR '=' '\n' EOF
;
node_DIR
:
| node_CHAR '/' node_DIR
;
node_CHAR
:
| node_TCHAR node_CHAR
;
node_TCHAR
: '='
;
and the input data poc is:
///=
running with antlr4-parse
:
in the test case, node_DIR
and node_CHAR
both can be ε
,this is not allowed for LL(1) grammars, which may lead to infinite backtracking parsing, optimize it:
grammar test;
entry
: node_DIR 'a' node_CHAR '=' '\n' EOF
;
node_DIR
: node_CHAR
| '/' node_DIR
;
node_CHAR
:
| node_TCHAR node_CHAR
;
node_TCHAR
: '='
;
it works!!!
ok, optimize json syntax like this:
- "<DIR>": [[], ["<CHAR>", "/", "<DIR>"]],
+ "<DIR>": [["<CHAR>"], ["/", "<DIR>"]],
- "<USERINFO>": [[], ["<CHAR>", ":", "<CHAR>", "@"]],
+ "<USERINFO>": [[], ["ui", "<CHAR>", ":", "<CHAR>", "@"]],
- "<BODY>": [[], ["<CHAR>"]],
+ "<BODY>": [["body"], ["<CHAR>"]],
- "<ANY>": [[], ["<DATE>"], ["<CHAR>"], ["<HOST>"], ["<URI>"], ["<SPECIAL>"]],
+ "<ANY>": [["any"], ["<DATE>"], ["<CHAR>"], ["<HOST>"], ["<URI>"], ["<SPECIAL>"]],
it works!!!
PS: please using https://github.com/AFLplusplus/Grammar-Mutator/commit/ff4e5a265daf5d88c4a636fb6a2c22b1d733db09, above patch maybe a mistake, I will continue to follow up
As discussed in #14 the following grammar causes a segfault from AFL (maybe only on startup?): https://paste.pr0.tips/rm
This is due to really long recursion: