Closed wxsBSD closed 7 years ago
Here's a contrived example of the performance increase when using the python callback. It's pretty extreme in this scenario, but I'm hoping someone with lots of real rules on lots of real files can use this to get a more accurate measurement. Things like metadata and strings will add more time to the callback setup because it has to convert those into python objects too.
Here's a simple script which shows it in use:
wxs@wxs-mbp yara-python % cat test.py
import os
import yara
def callback(data):
global count
count += 1
return yara.CALLBACK_CONTINUE
data = 'A' * 200
rule = ''
for i in range(2000):
rule += 'rule rule_%s { condition: false }' % i
rules = yara.compile(source=rule)
scanned = 0
count = 0
matches = 0
for i in range(2000):
scanned += 1
matches += len(rules.match(data=data, callback=callback, which=yara.CALLBACK_NON_MATCHES))
print('Scanned: %s' % scanned)
print('Counter: %s' % count)
print('Matches: %s' % matches)
This scans the same 200 bytes, 2000 times, each time with 2000 rules. In this case it will call the callback for non-matches (so 2000 * 2000 times). On my laptop this takes anywhere from 4.7 to 5.7 seconds. This particular run is on the higher end:
wxs@wxs-mbp yara-python % PYTHONPATH=build/lib.macosx-10.12-intel-2.7 /usr/bin/time python -S ./test.py
Scanned: 2000
Counter: 4000000
Matches: 0
5.74 real 5.43 user 0.16 sys
wxs@mbp yara-python %
If I change the which
argument to be yara.CALLBACK_MATCHES
we call the callback 0 times and things go much faster:
wxs@wxs-mbp yara-python % PYTHONPATH=build/lib.macosx-10.12-intel-2.7 /usr/bin/time python -S ./test.py
Scanned: 2000
Counter: 0
Matches: 0
0.16 real 0.14 user 0.01 sys
wxs@wxs-mbp yara-python %
As I said above, this is a highly contrived example, but I think it illustrates the point.
If you think this is going to be merged, please let me know and I'll update the documentation!
I think adding a yara.MODULES_CALLBACK option is great. It unifies things nicely and now that things are a bitmask we can still do things like which_callbacks=yara.MODULES_CALLBACK | yara.CALLBACK_MATCHES
- though I think if we are going to unify things we should make the names better. :) Let me know if you want me to implement that!
Also, sorry for the delay on the response. I forgot about this PR. :(
Is there something wrong with this? I see it got reverted. I’m happy to fix whatever is going on.
It breaks the test cases. The problem is that the match
method should returns a list of matches independently of the value for which_callbacks
(this value should only affect whether or not callback
is called), but the list wasn't being populated because yara_callback
was exiting early. I didn't have time to get into details. So your fix is welcomed.
Another thing that noticed is that CALLBACK_ALL
is 1, while CALLBACK_MATCHES
and CALLBACK_NON_MATCHES
are 2 and 4 respectively. CALLBACK_ALL
should be a bitwise or of all the existing options.
Thanks for the feedback. I’ll get it fixed up.
…ck will be called.