VirusTotal / yara-python

The Python interface for YARA
http://virustotal.github.io/yara/
Apache License 2.0
648 stars 179 forks source link

yara.Rules object cannot be pickled #84

Closed dimbtp closed 6 years ago

dimbtp commented 6 years ago

I tried to use multiprocessing in automatic yara analysis, but the problem happens --------------------------------Trace Log----------------------

Traceback (most recent call last):
  File "file.py", line 139, in scan_path
    pool.apply(func=worker, args=(path, filePath, rule_sets, num_first_bytes,))
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 244, in apply
    return self.apply_async(func, args, kwds).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 567, in get
    raise self._value
TypeError: can't pickle yara.Rules objects

rule_sets is a list which contain yara.Rules object(yara.compile('rule_path') result) Another problem is that: when I call pool.apply_async(func=work, args=(path, filePath, rule_sets, num_first_bytes,)) rather than pool.apply way i find that function [worker] doesn't execute at all(I tried to print a string in func [worker] but nothing printed)
Also tested that module [dill] can not handle yara.Rules object neither

The only way I can think up is rewriting them in C/C++ Any advice to solve this problem? THX

plusvic commented 6 years ago

The yara.Rules object is not a pure-Python object, it's implemented via a C extension and it doesn't support pickling. Instead of passing the compiled rules to your workers you could launch the workers first, passing the rules in text form, and compile the rules in the work function. In other words, instead of compiling the rules in the main process, each worker is responsible of compiling the rules for themselves.

wesinator commented 4 years ago

Similar issue with yara.Match :

  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '['REDACTED']'. Reason: 'TypeError("can't pickle yara.Match objects")'

(yara.Match objects saved in a field of object that is getting returned from pool)

plusvic commented 4 years ago

Correct, yara.Match is defined by a Python C extension and therefore it doesn't support pickling.

wesinator commented 4 years ago

Correct, yara.Match is defined by a Python C extension and therefore it doesn't support pickling.

Would it be possible for you to change it to have a more normal python object structure ?

it also doesn't support __dict__ , json.dumps() and other useful object representation functions .

I'm going to end up writing a wrapper to convert it to dict anyway if not.

Basically I need a way to get the object fields in usable Python data structure

plusvic commented 4 years ago

Implementing the pickle interface for objects defined in C is possible, but a bit tricky, so it's not in the roadmap. You can store the information in a pure Python object as you said.

wesinator commented 4 years ago

Implementing the pickle interface for objects defined in C is possible, but a bit tricky, so it's not in the roadmap. You can store the information in a pure Python object as you said.

is there a quick way (like built-in function) to convert the C object to pure python ?

I searched for an answer but couldn't find it - I wrote a snippet to manually build dicts from the fields https://gist.github.com/wesinator/eda62d75e8bd437267477a887406d0c8

thanks,