decalage2 / ViperMonkey

A VBA parser and emulation engine to analyze malicious macros.
1.04k stars 185 forks source link

Use a better parser #19

Open CapacitorSet opened 7 years ago

CapacitorSet commented 7 years ago

I think that a significant issue with ViperMonkey is that its parser doesn't support many VB features (see #2, #6, #12, #16, and I just ran into an issue myself). Rather than writing a parser "by hand", I suggest to use an existing grammar, eg. this ANTL4 grammar for VB6, and work from there: the parser will simply accept all valid constructs, and it will be up to ViperMonkey to implement them instead.

decalage2 commented 7 years ago

You are right. When I started ViperMonkey, I looked at ANTLR and the GOLD parser. I tried ANTLR with a VB6 grammar, but ended up with 1MB of python code auto generated... Then I switched to pyparsing, in order to start simple and implement the grammar piece by piece instead. It worked very well for small VBA macros, but fails on more complex ones.

I started implementing an alternate parser, still using pyparsing, but working line by line. It's more robust, and you can already try it using the "-a" option with the latest vmonkey code.

It would be great to try ANTLR, I am just lacking time to learn a new framework for now. I know some people already did something similar. Are you volunteering to help? ;-)

CapacitorSet commented 7 years ago

Sure! It actually seems rather easy, since the grammar is already written down; I'll try and make something with it.

decalage2 commented 7 years ago

@retailcoder has published a VBA grammar for ANTLR, based on the VB6 grammar with some VBA-specific features:

https://github.com/antlr/grammars-v4/tree/master/vba

CapacitorSet commented 7 years ago

I had some trouble dealing with the transformation of Antlr grammar elements into the ViperMonkey representation: I can successfully find the functions and subs, but I can't understand how to transform them into Sub and Function classes. Can you help? Il 08/gen/2017 16:26, "Philippe Lagadec" notifications@github.com ha scritto:

@retailcoder https://github.com/retailcoder has published a VBA grammar for ANTLR, based on the VB6 grammar with some VBA-specific features:

https://github.com/antlr/grammars-v4/tree/master/vba

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/decalage2/ViperMonkey/issues/19#issuecomment-271157652, or mute the thread https://github.com/notifications/unsubscribe-auth/AI21Fb0HF8KYqwaINdHdqIDbmO3BLjIHks5rQQAlgaJpZM4Ldbw7 .

decalage2 commented 7 years ago

Sure, can you push your code into your fork of ViperMonkey, so that I have a look? Or by e-mail?

CapacitorSet commented 7 years ago

Sure! I just pushed the code to my fork.

The build process is rather weird, because I set it up "on the spot"; surely it can be cleaned up. For now, you must download the antlr4.6 jar and pip2 install antlr4-python2-runtime to get antlr4 support; then, cd vipermonkey/core, java -Xmx500M -cp <path to antlr4 jar> org.antlr.v4.Tool -Dlanguage=Python2 VisualBasic6.g4, and it will generate the parser file (835 KB).

Running the analysis is awkward, because I couldn't figure out how to get the antlr4 lexer to accept a string, so it takes input from stdin. I ran it like this:

cat ./sample | python vipermonkey/vmonkey.py -l debug sample

(where sample contains plaintext VBA code)

decalage2 commented 7 years ago

Thanks a lot. On my side, I started using the grammar vba.g4 instead of VisualBasic6.g4, and it works quite well. I had to fix a few things to make it work with python, I'll publish it soon.

For now I am using PyCharm with its ANTLR plugin for g4 files. It can parse a VBA file or just text and show the parsed tree in the GUI without writing any code, which is very convenient for debugging the g4 grammar. I have not yet started to use the grammar from Python in vmonkey, will do it soon using your code as inspiration. Stay tuned! :-)

decalage2 commented 7 years ago

Well, it looks like I'll have to modify quite a lot of things in the grammar vba.g4 to make it usable with ViperMonkey. It will take time and efforts.

For example, the item "literal" can be a string, a boolean, an integer, etc (see https://github.com/antlr/grammars-v4/blob/master/vba/vba.g4#L675). There is no way to tell which literal type it is from the parse tree, so I would have to re-parse the resulting string to figure out how to convert it to a Python object. Instead, the best way is to define each literal type with a separate parser rule.

retailcoder commented 7 years ago

FYI Rubberduck is now using a much more robust grammar now, but there are limitations that we deal with... using additional grammars (and some C# code, too):

I'm pretty sure Rubberduck can parse just about anything you throw at it, as long as it's valid, compilable VBA (or VB6) code.

The reason I haven't PR'd it back into the ANTLR grammars repo is because, well, it's more of a complete rewrite/reimplementation than a mere refactoring or bugfix... and some Rubberduck-specifics have been embedded into it, too (e.g. the @Annotation comment syntax).

Literals, for example, are much more robust now:

// 5.6.5 Literal Expressions
literalExpression :
    numberLiteral
    | DATELITERAL
    | STRINGLITERAL
    | literalIdentifier typeHint?
;

numberLiteral : HEXLITERAL | OCTLITERAL | FLOATLITERAL | INTEGERLITERAL;

literalIdentifier : booleanLiteralIdentifier | objectLiteralIdentifier | variantLiteralIdentifier;
booleanLiteralIdentifier : TRUE | FALSE;
objectLiteralIdentifier : NOTHING;
variantLiteralIdentifier : EMPTY | NULL;

The grammar doesn't handle everything: line numbers blow it up, for example. But Rubberduck strips line numbers from the input (replaces with whitespace actually) before feeding it to the parser. The parse trees are quite populous, too, because WS tokens aren't ignored; OTOH this allows us to handle every possible form of legal line continuations.

decalage2 commented 7 years ago

I made some tests with real-life malicious macros, and the ANTLR parser with the python runtime and the vba.g4 grammar from the antlr repo is extremely slow, much slower than the current vmonkey parser using pyparsing.

So unless we find a better solution with a different grammar, or a way to run a faster ANTLR runtime (C, Java) from Python, I am not sure it is worth investing much time in this parser.

decalage2 commented 7 years ago

Many people have reported performance issues with the ANTLR Python runtime. Some issues can be addressed by optimizing the grammar to avoid specific constructs that are known to slow down the parser. But this looks like a significant work, and no guarantee to get good results in the end. A few hints:

I will not have time to do it myself, but if someone would like to look at how to improve the performance of ANTLR parsing with the VBA grammar (or try another parsing engine), please contact me.

decalage2 commented 7 years ago

If you want to test it yourself, I just pushed the vba.g4 grammar that I slightly fixed to run with Python, and the small python script that runs the parser on a VBA text file: see commit 6f8a6c537bdf3fad37a9857eda54eab8cdf8a9d0

You need to run makevba.bat to build the parser on Windows, or the equivalent command on Linux/Mac.

inshua commented 6 years ago

https://github.com/inshua/vba-interpreter/blob/master/src/Vba.g4

i had almost implement all vba language features, so this gramma file maybe helpful