VirusTotal / yara-python

The Python interface for YARA
http://virustotal.github.io/yara/
Apache License 2.0
659 stars 179 forks source link

Add a 'which' keyword to match(), which limits when the python callba… #61

Closed wxsBSD closed 7 years ago

wxsBSD commented 7 years ago

…ck will be called.

wxsBSD commented 7 years ago

Here's a contrived example of the performance increase when using the python callback. It's pretty extreme in this scenario, but I'm hoping someone with lots of real rules on lots of real files can use this to get a more accurate measurement. Things like metadata and strings will add more time to the callback setup because it has to convert those into python objects too.

Here's a simple script which shows it in use:

wxs@wxs-mbp yara-python % cat test.py
import os
import yara

def callback(data):
    global count
    count += 1
    return yara.CALLBACK_CONTINUE

data = 'A' * 200

rule = ''
for i in range(2000):
    rule += 'rule rule_%s { condition: false }' % i

rules = yara.compile(source=rule)
scanned = 0
count = 0
matches = 0
for i in range(2000):
    scanned += 1
    matches += len(rules.match(data=data, callback=callback, which=yara.CALLBACK_NON_MATCHES))
print('Scanned: %s' % scanned)
print('Counter: %s' % count)
print('Matches: %s' % matches)

This scans the same 200 bytes, 2000 times, each time with 2000 rules. In this case it will call the callback for non-matches (so 2000 * 2000 times). On my laptop this takes anywhere from 4.7 to 5.7 seconds. This particular run is on the higher end:

wxs@wxs-mbp yara-python % PYTHONPATH=build/lib.macosx-10.12-intel-2.7 /usr/bin/time python -S ./test.py
Scanned: 2000
Counter: 4000000
Matches: 0
        5.74 real         5.43 user         0.16 sys
wxs@mbp yara-python %

If I change the which argument to be yara.CALLBACK_MATCHES we call the callback 0 times and things go much faster:

wxs@wxs-mbp yara-python % PYTHONPATH=build/lib.macosx-10.12-intel-2.7 /usr/bin/time python -S ./test.py
Scanned: 2000
Counter: 0
Matches: 0
        0.16 real         0.14 user         0.01 sys
wxs@wxs-mbp yara-python %

As I said above, this is a highly contrived example, but I think it illustrates the point.

wxsBSD commented 7 years ago

If you think this is going to be merged, please let me know and I'll update the documentation!

wxsBSD commented 7 years ago

I think adding a yara.MODULES_CALLBACK option is great. It unifies things nicely and now that things are a bitmask we can still do things like which_callbacks=yara.MODULES_CALLBACK | yara.CALLBACK_MATCHES - though I think if we are going to unify things we should make the names better. :) Let me know if you want me to implement that!

Also, sorry for the delay on the response. I forgot about this PR. :(

wxsBSD commented 7 years ago

Is there something wrong with this? I see it got reverted. I’m happy to fix whatever is going on.

plusvic commented 7 years ago

It breaks the test cases. The problem is that the match method should returns a list of matches independently of the value for which_callbacks (this value should only affect whether or not callbackis called), but the list wasn't being populated because yara_callback was exiting early. I didn't have time to get into details. So your fix is welcomed.

Another thing that noticed is that CALLBACK_ALL is 1, while CALLBACK_MATCHES and CALLBACK_NON_MATCHES are 2 and 4 respectively. CALLBACK_ALL should be a bitwise or of all the existing options.

wxsBSD commented 7 years ago

Thanks for the feedback. I’ll get it fixed up.