CybOXProject / python-cybox

A Python library for parsing, manipulating, and generating CybOX content.
http://cybox.readthedocs.org/
BSD 3-Clause "New" or "Revised" License
77 stars 42 forks source link

Do not cache every created Object by default #290

Closed coolacid closed 6 years ago

coolacid commented 7 years ago

When creating multiple Artifacts/Observables in a loop memory usage seems to jump by the object size. When using hpy from guppy to watch the python heap, it shows each iteration keeps a copy of our Observable object even though we don't do anything with it.

Possible Related to #196

Simple example: dd count=10 bs=1M if=/dev/zero of=zeros

#!/usr/bin/env python

from cybox.core import Observable
from cybox.objects.artifact_object import Artifact
from guppy import hpy

with open("zeros", "rb") as f:
    data = f.read()

hp = hpy()
hp.setrelheap()
for i in range (0,100):
    artifact = Artifact(data, Artifact.TYPE_NETWORK)
    Observable(artifact)
    h = hp.heap()
    print h.byrcs

Running the above script with a zeros file created using the dd command above shows dict of cybox.objects.artifact_object.Artifact increasing in size each iteration.

This only happens every time you create a new Artifact and Observable. If the artifact creation line is outside of the for loop, the memory doesn't run away.

Why this is a problem:

If we're looping through a bunch of files, and creating new cybox packages and sending them out, this bug causes memory to run away until either an OOM error happens, or the script finishes. There doesn't seem to be a way to clear the heap space between creating new packages.

averagesecurityguy commented 7 years ago

Current theory:

An Artifact is a subclass of ObjectProperties. The ObjectProperties class has a parent method that creates a new Object and saves it in self._parent. The parent method is called as part of the init method of the Observables object. So when you create one Artifact and multiple Observables using that artifact, they all have the same parent so no memory increase. When you create multiple artifacts but no Observables, the parent method is never called so a new Object is never created. If you create multiple Artifacts and multiple Observables you will end up with twice as many Artifact objects.

This should happen for any Observable that is created from an object that is a subclass of ObjectProperties. In theory the following code should clean up everything:

artifact = Artifact(data, Artifact.TYPE_NETWORK)
obs = Observable(artifact)
// Do Something
del obs
del artifact
coolacid commented 7 years ago

Theory is better than I would have gotten - alas, the delete doesn't actually delete the objects.

After a loop of 20:

Partition of a set of 298 objects. Total size = 55448 bytes.
 Index  Count   %     Size   % Cumulative  % Referrers by Kind (class / dict of class)
     0     40  13     7040  13      7040  13 dict of cybox.objects.artifact_object.Artifact
     1     20   7     5600  10     12640  23 0x20bd930
     2     20   7     5600  10     18240  33 cybox.core.object.Object
     3     20   7     5600  10     23840  43 cybox.objects.artifact_object.Artifact
     4     20   7     5600  10     29440  53 dict of 0x20bd930
     5     20   7     5600  10     35040  63 dict of cybox.core.object.Object
     6     65  22     4864   9     39904  72 dict (no owner)
     7     12   4     3360   6     43264  78 dict of guppy.etc.Glue.Owner
     8      2   1     2096   4     45360  82 guppy.heapy.Classifiers.ByRetClaSet
     9     20   7     1280   2     46640  84 dict (no owner), dict of
                                             cybox.objects.artifact_object.Artifact
averagesecurityguy commented 7 years ago

If you are willing to try out a code change in the library, you can do this and see if it cleans up the objects:

In the file cybox/common/object_properties.py add the following at line 67:

    def __del__(self):
        del self._parent
coolacid commented 7 years ago

Added - same result :(

averagesecurityguy commented 7 years ago

Did you do this in combination with the call to del obs and del artifact in the loop?

gtback commented 7 years ago

Hey, @coolacid and @averagesecurityguy

I'm taking a look at this now. I think the parent thing may be a red herring. I tried converting them to weakref.proxy objects, with no success. I was able to reproduce the leak with just o = Object() in the loop, so my current thinking is that it's related to the related_objects somehow.

gtback commented 7 years ago

Ok, found it. There is logic that caches every Object when it's created. Let me verify the workaround then I'll let you know.

gtback commented 7 years ago

The error is this line (specifically, the postset_hook).

The workaround is to put this at the top of your script somewhere:

Object.id_.postset_hook = None

I'm going to leave this issue open, since that should not be the default behavior, and users shouldn't have to do that.

coolacid commented 7 years ago

Confirmed against my test script, Also needed to import Object from cybox.core as well.

I'll implement this in my actual scripts and report back tomorrow.

Thanks!

gtback commented 7 years ago

Also needed to import Object from cybox.core as well.

Yes, sorry, I added that import to the top of my copy the script, since I was testing creating different combinations of Observables, Objects, and Artifacts.

coolacid commented 7 years ago

Confirmed - this appears to have solved the issues. Thanks!

emmanvg commented 7 years ago

I don't know if we want to remove that functionality. Tracking down the changes, it appears to be the default behavior since 2015. It is hard to track and say who makes use of the caching feature, but removing it might introduce problems for other users workflow. Note that, there is a method to clear the global cache under cybox.utils.caches.cache_clear() that you can use to clear the DictCache at any moment.

I can add some documentation on the Object class to reference and encourage the use of the aforementioned method to prevent the memory issue you are experiencing.

gtback commented 6 years ago

If I remember correctly, it has to do with parsing CybOX content that contains related objects. I'm OK leaving this in, given the caveat I saw you added to the documentation in #308.

gtback commented 6 years ago

The workaround described by @emmanvg in #308 is the best we're going to be able to do right now. I'm worried that the current behavior is required by code that parses objects (particularly related objects). I'm going to leave this issue open, but remove it from the current milestone.

gtback commented 6 years ago

I created #317 to address better documentation of this issue, but I don't think we should actually change the caching behavior, for reasons discussed above.