Closed ruppde closed 3 years ago
slowdown reason is the following block in yara_callback. yara-python still matches correctly after /**/ing it, but 2-4 times faster. whatever functionality needs this data is better gets if for each hit and not for each file scanned.
yr_rule_tags_foreach(rule, tag)
{
object = PY_STRING(tag);
PyList_Append(tag_list, object);
Py_DECREF(object);
}
yr_rule_metas_foreach(rule, meta)
{
if (meta->type == META_TYPE_INTEGER)
object = Py_BuildValue("i", meta->integer);
else if (meta->type == META_TYPE_BOOLEAN)
object = PyBool_FromLong((long) meta->integer);
else
object = PY_STRING(meta->string);
PyDict_SetItemString(meta_list, meta->identifier, object);
Py_DECREF(object);
}
yr_rule_strings_foreach(rule, string)
{
yr_string_matches_foreach(context, string, m)
{
object = PyBytes_FromStringAndSize((char*) m->data, m->data_length);
tuple = Py_BuildValue(
"(L,s,O)",
m->base + m->offset,
string->identifier,
object);
PyList_Append(string_list, tuple);
Py_DECREF(object);
Py_DECREF(tuple);
}
}
Here's a fix which passes all tests but isn't beautiful (I'm not a c guy ;)
Move the code block from the message above under
if (message == CALLBACK_MSG_RULE_MATCHING)
{
... and put one more copy below (make it a function in real life):
if (callback != NULL &&
((message == CALLBACK_MSG_RULE_MATCHING && (which & CALLBACK_MATCHES)) ||
(message == CALLBACK_MSG_RULE_NOT_MATCHING && (which & CALLBACK_NON_MATCHES))))
{
Py_INCREF(callback);
This looks a bit like the problem solved in https://github.com/hillu/go-yara/commit/9913b4f8317c9c546a48b5e902395f3a2a463495.
@hillu Agreed, and my patch should address it though I'm having trouble getting reliable measurements out of it.
Here's totally trivial way to reproduce it: Scan 1000 small files with this rule, twice to rule out caching:
rule test {
meta:
hash = "3bd096014dda1e8a430bb29333834926be39ff95540ed5fc651666c7c3b99107"
strings:
$whatever = "sdklj34lksdlk34lkasdklqwklewlkklwrfklk4lk534"
condition:
$whatever
}
Then use your favorite text editor to copy the hash line 9999 times and scan again, twice to rule out caching.
I can reproduce this now and have confirmed my fix is correct (and simpler IMO). Here's the exact steps I took:
for i in $(jot 1000 1 1000); do dd if=/dev/random of=$i bs=1m count=1; done
to get 1000 1MB files in a directory (they are in ~/Desktop/random_samples
for me).test
rule with a single metadata value.Output should look like this:
wxs@wxs-mbp yara-python % PYTHONPATH=build/lib.macosx-10.9-x86_64-3.7 python3 ~/test.py
=======================================================
Scanning with callback: <function callback at 0x7ff45628b050> which: 3
Time: 0:00:02.016841
=======================================================
Scanning with callback: <function callback at 0x7ff45628b050> which: 1
Time: 0:00:02.000356
=======================================================
Scanning with callback: <function callback at 0x7ff45628b050> which: 2
Time: 0:00:02.008413
wxs@wxs-mbp yara-python %
Output should look like this:
wxs@wxs-mbp yara-python % PYTHONPATH=build/lib.macosx-10.9-x86_64-3.7 python3 ~/test.py
=======================================================
Scanning with callback: <function callback at 0x7fbed13ab050> which: 3
Time: 0:00:16.998390
=======================================================
Scanning with callback: <function callback at 0x7fbed13ab050> which: 1
Time: 0:00:17.057019
=======================================================
Scanning with callback: <function callback at 0x7fbed13ab050> which: 2
Time: 0:00:16.883535
wxs@wxs-mbp yara-python %
This demonstrates that we are spending a lot of time in converting the metadata to python objects.
Output should look like this:
wxs@wxs-mbp yara-python % PYTHONPATH=build/lib.macosx-10.9-x86_64-3.7 python3 ~/test.py
=======================================================
Scanning with callback: <function callback at 0x7fc2b8386050> which: 3
Time: 0:00:15.993657
=======================================================
Scanning with callback: <function callback at 0x7fc2b8386050> which: 1
Time: 0:00:01.962936
=======================================================
Scanning with callback: <function callback at 0x7fc2b8386050> which: 2
Time: 0:00:16.578360
wxs@wxs-mbp yara-python %
This shows that in the case where we have no matches and the user only wants to see matches (which: 1
) we are skipping creating all the metadata because 1) it is not a match so it isn't going to be stored in the matches list and 2) the user doesn't want to see it in the callback.
Here is my test script:
import os
import yara
from datetime import datetime
dir_ = "/Users/wxs/Desktop/random_samples"
rulesfile = "/Users/wxs/Desktop/test.yar"
def callback(data):
return yara.CALLBACK_CONTINUE
def scan(rules, user_callback, which):
print("=======================================================")
print(f"Scanning with callback: {user_callback} which: {which}")
start = datetime.now()
for file_ in os.listdir(dir_):
rules.match(os.path.join(dir_, file_), callback=user_callback, which_callbacks=which)
end = datetime.now()
print("Time: ", end - start)
rules = yara.compile(rulesfile)
scan(rules, callback, yara.CALLBACK_ALL)
scan(rules, callback, yara.CALLBACK_MATCHES)
scan(rules, callback, yara.CALLBACK_NON_MATCHES)
Turns out this is very much the same optimization that @hillu mentioned above in his go bindings.
This also begs the question of if we want to change the default behavior of yara-python to use CALLBACK_MATCHES
by default instead of CALLBACK_ALL
. This is likely to be a significant performance gain for the common case (lots of rules that don't match) but would come at the expense of breaking scripts which rely upon it.
yara-python slows down by 40-70% if lots of rules with meta data are used:
btw, without metadata yara-python approaches the speed of yara.c with -p 1
tested with the 2481 rules (10 meta: lines each) from https://valhalla.nextron-systems.com/ (click "demo" and "get rules") but you'll see the effect with any .yar with 1000+ rules and multiple lines of metadata.
code used for testing: