digitalwave / msc_pyparser

A ModSecurity rules parser
GNU General Public License v3.0
27 stars 10 forks source link

parsing error in crs_read.py #9

Closed MirkoDziadzka closed 4 years ago

MirkoDziadzka commented 4 years ago

how to reproduce

SecRule ARGS "@contains :" "id:42,phase:2,block,t:none"

expected result

actual result

$ python3 ./crs_read.py test .
Parsing CRS config: test/test.conf
Parser error: syntax error in line 0 at pos 25, column 25
SecRule ARGS "@contains :" "id:42,phase:2,block,t:none"
~~~~~~~~~~~~~~~~~~~~~~~~~^

remarks

On a bigger modsec ruleset, I see around 1 parsing errors per 1000 rules. Here are simplified examples:

Parser error: syntax error in line 0 at pos 28, column 28
SecRule ARGS "@rx ^\\\\" "id:42,phase:2,block,t:none"
~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
Parser error: syntax error in line 0 at pos 19, column 19
SecRule ARGS "@rx :\/" "id:42,phase:2,block,t:none"
~~~~~~~~~~~~~~~~~~~^
Parser error: syntax error in line 0 at pos 19, column 19
SecRule ARGS "@pm |" "id:42,phase:2,block,t:none"
~~~~~~~~~~~~~~~~~~~^
Parser error: syntax error in line 0 at pos 21, column 21
SecRule ARGS:'/^(foo|bar)$/' "@rx 42" "id:42,phase:2,block,t:none"
~~~~~~~~~~~~~~~~~~~~~^

And 2 errors which where new to me (that this is allowed in modsec)

Lexer error: illegal token found in line 0 at pos 12, column 12
SecRule ARGS,REQUEST_HEADERS "@rx 42" "id:42,phase:2,block,t:none"
~~~~~~~~~~~~^
Lexer error: illegal token found in line 0 at pos 13, column 13
SecRule ARGS foobar "id:42,phase:2,block,t:none"
~~~~~~~~~~~~~^

Detail

$ pip freeze
aniso8601==1.2.1
appdirs==1.4.3
APScheduler==3.5.0
babelfish==0.5.5
beautifulsoup4==4.6.0
bs4==0.0.1
certifi==2017.4.17
chardet==3.0.3
cheroot==8.2.1
CherryPy==18.4.0
click==6.7
colorclass==2.2.0
cssselect==1.1.0
fake-useragent==0.1.11
feedparser==5.2.1
Flask==1.0.2
Flask-Compress==1.4.0
Flask-Cors==3.0.2
Flask-Login==0.4.0
Flask-RESTful==0.3.6
flask-restplus==0.10.1
FlexGet==3.1.6
guessit==3.1.0
html5lib==0.999999999
idna==2.5
itsdangerous==0.24
jaraco.functools==2.0
Jinja2==2.10.1
jsonschema==2.6.0
loguru==0.4.0
lxml==4.4.2
MarkupSafe==1.0
more-itertools==7.2.0
msc-pyparser==0.2
netaddr==0.7.19
parse==1.14.0
plumbum==1.6.3
ply==3.11
portend==2.6
progressbar==2.5
pyee==6.0.0
pynzb==0.1.0
pyparsing==2.2.0
pyppeteer==0.0.25
pyquery==1.4.1
PyRSS2Gen==1.1
python-dateutil==2.6.1
pytz==2017.2
PyYAML==5.1.2
rebulk==2.0.0
requests==2.21.0
requests-html==0.10.0
rpyc==4.0.1
six==1.13.0
SQLAlchemy==1.3.11
tempora==1.8
terminaltables==3.1.0
tqdm==4.41.1
tzlocal==1.4
urllib3==1.24.2
w3lib==1.21.0
webencodings==0.5.1
websockets==8.1
Werkzeug==0.15.6
xmltodict==0.12.0
zc.lockfile==2.0
zxcvbn-python==4.4.15
airween commented 4 years ago

Hi @MirkoDziadzka,

many thanks for your detailed feedback!

After a quick review, I can confirm your remarks, and I'll try to fix them soon.

Anyway, please let me show you some tip, how can you debug the rules. The package contains (and the documentation mentions it), there are two "helper" script, which show its own results. If you run those tools with arguments (eg. you create a file with only one line, which contains the affected rule), you can see the parsed symbols (test_lexer.py), or the flow of building of AST (test_parser.py), just add a 2nd argument with name debug, eg:

./examples/test_lexer.py test.conf debug

In case of these rules:

SecRule ARGS "@contains :" "id:42,phase:2,block,t:none"
SecRule ARGS "@rx :\/" "id:42,phase:2,block,t:none"
SecRule ARGS "@pm |" "id:42,phase:2,block,t:none"

the problem is that the special symbols in operator argument evaluated as its meaning (eg. COLON or PIPE), not a regular character. I think this is a valid request to fix it. :).

In this case:

SecRule ARGS:'/^(foo|bar)$/' "@rx 42" "id:42,phase:2,block,t:none"

the | is a separator, similar than above - it's a PIPE, not a regular character, and the next expected token should be an another variable. Also a valid remark.

See this:

SecRule ARGS,REQUEST_HEADERS "@rx 42" "id:42,phase:2,block,t:none"

As I know, the separator character in SecLang or ModSecurity Language is the |, this is the first case where I met with this form. I changed the , by a |, and the script parsed the rule. Anyway, I'll try to modify the tool for this expression. After a quick view, I think it's easy to extend the parser that it can capable to parse successfully this form.

Finally, this rule

SecRule ARGS foobar "id:42,phase:2,block,t:none"

is a "classic" form :). This parser follows the CRS syntax, therefore the operator and its arguments need to quoted by a double quote mark. I changed the syntax with this:

SecRule ARGS "foobar" "id:42,phase:2,block,t:none"

and the parse also could be produced a YAML result. I think this will be the last on the list of fixes :).

Sorry for the inconvenience, I'll back soon with fixes.

MirkoDziadzka commented 4 years ago

Hi @airween

Thanks for the feedback and the hints for debugging. I know that these scripts where mainly written from CRS, but I think the would be useful for generic modsec rules. I'm currently analyse a 15'000 rules database and it really helps to see the data in JSON ;-)

airween commented 4 years ago

Hi @MirkoDziadzka,

sorry for the late reply.

In this branch, you can check the fixes for items of your list: 1, 3, 4, 6, 7. The 2nd and the 5th issues are left, that's not so trivials :).

Please check that if you can, and let me know the results - thank you.

airween commented 4 years ago

The 2nd example also fixed:

SecRule ARGS "@rx ^\\\\" "id:42,phase:2,block,t:none"

Some notes for the fixes: please note, that when you write back the parsed data to SecLang format (if you want), there are two cases where the result will mismatch from the origin content. The parsed structure doesn't contains the separator of SecRule arguments, it assumes that you use |. If you write back, then the | will be used. So, from this:

SecRule ARGS,REQUEST_HEADERS "@rx 42" "id:42,phase:2,block,t:none"

will write back like this:

SecRule ARGS|REQUEST_HEADERS "@rx 42" "id:42,phase:2,block,t:none"

The another case it the operatorless rule:

SecRule ARGS foobar "id:42,phase:2,block,t:none"

Here the parsed structure also doesn't contain any "metadata", eg. the quoted state of source, it assumes that the operator and its argument were quoted with a ". When you write back, then the output will contains the quoted argument (but as you didn't used explicit @rx operator, you will not see that). In this case, you will get:

SecRule ARGS "foobar" "id:42,phase:2,block,t:none"
mirkodziadzka-avi commented 4 years ago

Hey @airween

Thanks for the fix, looks good so far.

However, I found a new problem.

In summary: The following is working fine

SecRule ARGS "@rx ^\\\\" "id:42,phase:2,block,t:none"
SecRule ARGS "@contains :" "id:42,phase:2,block,t:none"
SecRule ARGS "@pm |" "id:42,phase:2,block,t:none"
SecRule ARGS "@rx :\/" "id:42,phase:2,block,t:none"
SecRule ARGS,REQUEST_HEADERS "@rx 42" "id:42,phase:2,block,t:none"
SecRule ARGS foobar "id:42,phase:2,block,t:none"
SecRule REQUEST_URI|REQUEST_HEADERS "@rx 42" "id:42,phase:2,block,t:none"

The following is not. The first of the 2 is new

SecRule REQUEST_URI,REQUEST_HEADERS "@rx 42" "id:42,phase:2,block,t:none"
SecRule ARGS:'/^(foo|bar)$/' "@rx 42" "id:42,phase:2,block,t:none"

I currently do not the the write back functionality, and I think normalisation is a good thing.

airween commented 4 years ago

Hi @MirkoDziadzka,

thanks again, the 47f939 fixes the new COMMA related bug:

SecRule REQUEST_URI,REQUEST_HEADERS "@rx 42" "id:42,phase:2,block,t:none"

The other will fixed soon. Check this if you can - thanks.

airween commented 4 years ago

Hi @MirkoDziadzka,

now the issue_09 branch contains all fixes of your mentioned examples.

Please update your local copy, and run the tests again - let me know the results :).

mirkodziadzka-avi commented 4 years ago

@airween, I added some comments to https://github.com/digitalwave/msc_pyparser/commit/a66a4525c18e5cd91d22f671b1a3f1716f77a7b9

With the current issue_09 and the following change, I can parse all of my ~ 15'000 rules.

diff --git a/msc_pyparser.py b/msc_pyparser.py
index 32c5404..e2411d0 100644
--- a/msc_pyparser.py
+++ b/msc_pyparser.py
@@ -140,7 +140,7 @@ class MSCLexer(object):
         return t

     def t_secrulesecvararg_SECRULE_VARIABLE_ARG(self, t):
-        r'([^ \s\t\n\|\'|]{1,})|(\'.*\')'
+        r"([^\s|']+)|('(\.|[^'\\])*')"
         t.lexer.pop_state()
         return t
airween commented 4 years ago

Hi @MirkoDziadzka,

I merged all commits into the master branch.

Referring to your note, that everything works as well, I close this issue.

Many thanks for your report and help!

airween commented 4 years ago

Just one note for testing: you wrote that you have 15000 parsed rule. If you want to be sure that your result is correct, please convert back your json/yaml structure to SecLang format, and make a diff. Here is how I use that:

mkdir testexp/
mkdir testimp/
./crs_read.py /home/airween/src/owasp-modsecurity-crs/rules testexp
./crs_write.py testexp testimp
for f in `ls -1 /home/airween/src/owasp-modsecurity-crs/rules/*.conf`; do b=`basename ${f}`; echo "diff ${f} testimp/${b}"; diff ${f} testimp/${b}; done

As I wrote above, the crs_write.py follow the strict format of CRS, I assume you will have some difference between the original source and generated output - but you can compare them.

mirkodziadzka-avi commented 4 years ago

Hi, thanks for the fix.

2 observations

$ time ./crs_write.py yyy zzz
Parsing CRS structure: yyy/Combined.yaml
Exception catched -  (<class 'KeyError'>, KeyError('oplineno'), <traceback object at 0x10375b230>)

real    0m20.290s
user    0m19.916s
sys 0m0.261s

with full stack trace

Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pdb.py", line 1699, in main
    pdb._runscript(mainpyfile)
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pdb.py", line 1568, in _runscript
    self.run(statement)
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/bdb.py", line 585, in run
    exec(cmd, globals, locals)
  File "<string>", line 1, in <module>
  File "/Users/mirko/Trustwave/crs_write.py", line 3, in <module>
    import sys
  File "/Users/mirko/.local/share/virtualenvs/mirko-jYbmoHav/lib/python3.7/site-packages/msc_pyparser-0.2.3-py3.7.egg/msc_pyparser.py", line 721, in generate
    self.make_secrule(i)
  File "/Users/mirko/.local/share/virtualenvs/mirko-jYbmoHav/lib/python3.7/site-packages/msc_pyparser-0.2.3-py3.7.egg/msc_pyparser.py", line 699, in make_secrule
    if r['lineno'] < r['oplineno']:
KeyError: 'oplineno'
Uncaught exception. Entering post mortem debugging
Running 'cont' or 'step' will restart the program
> /Users/mirko/.local/share/virtualenvs/mirko-jYbmoHav/lib/python3.7/site-packages/msc_pyparser-0.2.3-py3.7.egg/msc_pyparser.py(699)make_secrule()
-> if r['lineno'] < r['oplineno']:
(Pdb) p r
{'actions': [{'act_arg': 'none', 'act_name': 't', 'act_quote': 'no_quote', 'lineno': 28305}], 'chained': False, 'lineno': 28305, 'operator': '', 'operator_argument': '.*', 'type': 'SecRule', 'variables': ['ARGS:c']}

I will extract a simplified example and will open another issue

airween commented 4 years ago

Hi,

just a quick note for the parsing speed,

* parsing feels slower as before, but I do not have numbers to prove this. I now see 150 seconds for 15'000 rules, so 100 rules/second.

I have a bit worse value:

$ time ./crs_read.py /home/airween/src/owasp-modsecurity-crs/rules/ testexp
...

real    0m8.658s
user    0m8.612s
sys 0m0.012s

CRS contains about 600 SecRule objects (plus chained rules, they are about 170), so the total value is 770. It's about 90 rules/second, but my test environment runs on a very old machine (4x X3440 CPU with SATA2 disks). May be we can optimize the code, but I'm not worried about the speed, I think this isn't an operation which performed too often. But you're right, the 2.5 minutes is a bit painful :).

Note, that actually I'm working on an another project: I'ld like to implement this parser in C, I'm curious how fast will be that (oh, sure, Python isn't a performance champion :))

mirkodziadzka-avi commented 4 years ago

Ok, I have a faster machine here. When I talk about 15'000 rules, I mean 15'000 chains. At the end, this are more than 30'000 SecRule commands.

Btw, It is not the parser which is this slow, it is yaml.

When I change crs_read from yaml to json, I get 20 seconds instead of 200.

Therefore my impression that it got slower, I worked with JSON before and switched to yaml to test the crs_write part.