VirusTotal / yara-python

The Python interface for YARA
http://virustotal.github.io/yara/
Apache License 2.0
661 stars 181 forks source link

Metadata slowing down scans by 40-70% on 1000+ rules #170

Closed ruppde closed 3 years ago

ruppde commented 3 years ago

yara-python slows down by 40-70% if lots of rules with meta data are used:

$ python3 yara_simple_scanner.py -r valhalla-rules.yar ../samples
...
scan took: 15.46 - matches: 114 with "rules stripped_meta" 
scan took: 37.95 - matches: 114 with "rules original" 

One more round to rule out caching effects:
scan took: 13.05 - matches: 114 with "rules stripped_meta" 
scan took: 38.16 - matches: 114 with "rules original" 

btw, without metadata yara-python approaches the speed of yara.c with -p 1

time yara -r -p 1 valhalla-rules.yar ../samples|wc -l
114

real    0m10,542s
user    0m7,268s
sys 0m3,498s

tested with the 2481 rules (10 meta: lines each) from https://valhalla.nextron-systems.com/ (click "demo" and "get rules") but you'll see the effect with any .yar with 1000+ rules and multiple lines of metadata.

code used for testing:

#!/usr/bin/env python3

import yara      # pip install yara-python
import argparse
import plyara
import plyara.utils
from os import walk, path
from time import time

def yara_match(data):
    #print("yara_match: ", data)
    global count
    count += 1
    return yara.CALLBACK_CONTINUE

def do_scan(dirpath, rules, name):

    start = time()
    #print("scaning: ", path)
    global count
    count = 0
    for root, directories, files in walk(dirpath, followlinks=False):

        for filename in files:
            #print(root, filename)
            filePath = path.join(root, filename)
            #print("scanning " , filePath)
            try:
                matches = rules.match(filePath, 
                        fast=True,
                        # callback makes no difference
                        #which_callbacks=yara.CALLBACK_MATCHES,
                        #callback=yara_match,
                        )
                if matches:
                    for match in matches:
                        #print(match.rule, filePath)
                        count += 1
            except Exception as e:
                print("ERROR", "FileScan", "Cannot YARA scan file: %s" % filePath)

                if not args.recursive:
                    break

    end = time()
    print("scan took: %0.2f - matches: %d with \"%s\" " % ( ( end - start ), count, name ))

############################### main() ###########################################

def main():

    # Argument parsing
    parser = argparse.ArgumentParser(description='yara_simple_scanner.py')
    parser.add_argument('RULES_FILE', help='Path to rules file')
    parser.add_argument('DIR', help='Path to scan')
    parser.add_argument('-r','--recursive',  help='recursively search directories',  action="store_true")
    parser.add_argument('-p',  help='Ignored, just here for lazyness purposes to have compatible params', action="store_true")

    args = parser.parse_args()

    rulesfile = args.RULES_FILE

    print("Loading original rules from %s" % (rulesfile))
    rules = yara.compile(filepaths={
      'rules':rulesfile
    })

    print("Creating copy of rules with stripped metadata")
    parser = plyara.Plyara()
    with open(rulesfile, 'r') as fh:
        yararules = parser.parse_string(fh.read())
    stripped_rules_plain = []
    for rule in yararules:
        rule['metadata'] = ""
        stripped_rules_plain.append(plyara.utils.rebuild_yara_rule(rule) )

    rules_stripped = yara.compile(source=''.join(stripped_rules_plain))
    print("Finished loading rules")

    dirpath = str(args.DIR)
    print("Scaning starts\n")
    do_scan(dirpath, rules_stripped, "rules stripped_meta")
    do_scan(dirpath, rules, "rules original")
    print("\nOne more round to rule out caching effects:")
    do_scan(dirpath, rules_stripped, "rules stripped_meta")
    do_scan(dirpath, rules, "rules original")

if __name__ == '__main__':
    main()
ruppde commented 3 years ago

slowdown reason is the following block in yara_callback. yara-python still matches correctly after /**/ing it, but 2-4 times faster. whatever functionality needs this data is better gets if for each hit and not for each file scanned.

  yr_rule_tags_foreach(rule, tag)
  {
    object = PY_STRING(tag);
    PyList_Append(tag_list, object);
    Py_DECREF(object);
  }

  yr_rule_metas_foreach(rule, meta)
  {
    if (meta->type == META_TYPE_INTEGER)
      object = Py_BuildValue("i", meta->integer);
    else if (meta->type == META_TYPE_BOOLEAN)
      object = PyBool_FromLong((long) meta->integer);
    else
      object = PY_STRING(meta->string);

    PyDict_SetItemString(meta_list, meta->identifier, object);
    Py_DECREF(object);
  }

  yr_rule_strings_foreach(rule, string)
  {
    yr_string_matches_foreach(context, string, m)
    {
      object = PyBytes_FromStringAndSize((char*) m->data, m->data_length);

      tuple = Py_BuildValue(
          "(L,s,O)",
          m->base + m->offset,
          string->identifier,
          object);

      PyList_Append(string_list, tuple);

      Py_DECREF(object);
      Py_DECREF(tuple);
    }
  }
ruppde commented 3 years ago

Here's a fix which passes all tests but isn't beautiful (I'm not a c guy ;)

Move the code block from the message above under

  if (message == CALLBACK_MSG_RULE_MATCHING)
  {

... and put one more copy below (make it a function in real life):

  if (callback != NULL &&
      ((message == CALLBACK_MSG_RULE_MATCHING && (which & CALLBACK_MATCHES)) ||
       (message == CALLBACK_MSG_RULE_NOT_MATCHING && (which & CALLBACK_NON_MATCHES))))
  {
    Py_INCREF(callback);
hillu commented 3 years ago

This looks a bit like the problem solved in https://github.com/hillu/go-yara/commit/9913b4f8317c9c546a48b5e902395f3a2a463495.

wxsBSD commented 3 years ago

@hillu Agreed, and my patch should address it though I'm having trouble getting reliable measurements out of it.

ruppde commented 3 years ago

Here's totally trivial way to reproduce it: Scan 1000 small files with this rule, twice to rule out caching:

rule test {
    meta:
        hash = "3bd096014dda1e8a430bb29333834926be39ff95540ed5fc651666c7c3b99107"        
    strings:
        $whatever = "sdklj34lksdlk34lkasdklqwklewlkklwrfklk4lk534"
    condition:
    $whatever
}

Then use your favorite text editor to copy the hash line 9999 times and scan again, twice to rule out caching.

wxsBSD commented 3 years ago

I can reproduce this now and have confirmed my fix is correct (and simpler IMO). Here's the exact steps I took:

Output should look like this:

wxs@wxs-mbp yara-python % PYTHONPATH=build/lib.macosx-10.9-x86_64-3.7 python3 ~/test.py
=======================================================
Scanning with callback: <function callback at 0x7ff45628b050> which: 3
Time:  0:00:02.016841
=======================================================
Scanning with callback: <function callback at 0x7ff45628b050> which: 1
Time:  0:00:02.000356
=======================================================
Scanning with callback: <function callback at 0x7ff45628b050> which: 2
Time:  0:00:02.008413
wxs@wxs-mbp yara-python %

Output should look like this:

wxs@wxs-mbp yara-python % PYTHONPATH=build/lib.macosx-10.9-x86_64-3.7 python3 ~/test.py
=======================================================
Scanning with callback: <function callback at 0x7fbed13ab050> which: 3
Time:  0:00:16.998390
=======================================================
Scanning with callback: <function callback at 0x7fbed13ab050> which: 1
Time:  0:00:17.057019
=======================================================
Scanning with callback: <function callback at 0x7fbed13ab050> which: 2
Time:  0:00:16.883535
wxs@wxs-mbp yara-python %

This demonstrates that we are spending a lot of time in converting the metadata to python objects.

Output should look like this:

wxs@wxs-mbp yara-python % PYTHONPATH=build/lib.macosx-10.9-x86_64-3.7 python3 ~/test.py
=======================================================
Scanning with callback: <function callback at 0x7fc2b8386050> which: 3
Time:  0:00:15.993657
=======================================================
Scanning with callback: <function callback at 0x7fc2b8386050> which: 1
Time:  0:00:01.962936
=======================================================
Scanning with callback: <function callback at 0x7fc2b8386050> which: 2
Time:  0:00:16.578360
wxs@wxs-mbp yara-python %

This shows that in the case where we have no matches and the user only wants to see matches (which: 1) we are skipping creating all the metadata because 1) it is not a match so it isn't going to be stored in the matches list and 2) the user doesn't want to see it in the callback.

Here is my test script:

import os
import yara
from datetime import datetime

dir_ = "/Users/wxs/Desktop/random_samples"
rulesfile = "/Users/wxs/Desktop/test.yar"

def callback(data):
    return yara.CALLBACK_CONTINUE

def scan(rules, user_callback, which):
    print("=======================================================")
    print(f"Scanning with callback: {user_callback} which: {which}")
    start = datetime.now()
    for file_ in os.listdir(dir_):
        rules.match(os.path.join(dir_, file_), callback=user_callback, which_callbacks=which)
    end = datetime.now()
    print("Time: ", end - start)

rules = yara.compile(rulesfile)
scan(rules, callback, yara.CALLBACK_ALL)
scan(rules, callback, yara.CALLBACK_MATCHES)
scan(rules, callback, yara.CALLBACK_NON_MATCHES)

Turns out this is very much the same optimization that @hillu mentioned above in his go bindings.

This also begs the question of if we want to change the default behavior of yara-python to use CALLBACK_MATCHES by default instead of CALLBACK_ALL. This is likely to be a significant performance gain for the common case (lots of rules that don't match) but would come at the expense of breaking scripts which rely upon it.