Closed coolacid closed 6 years ago
Current theory:
An Artifact is a subclass of ObjectProperties. The ObjectProperties class has a parent method that creates a new Object and saves it in self._parent. The parent method is called as part of the init method of the Observables object. So when you create one Artifact and multiple Observables using that artifact, they all have the same parent so no memory increase. When you create multiple artifacts but no Observables, the parent method is never called so a new Object is never created. If you create multiple Artifacts and multiple Observables you will end up with twice as many Artifact objects.
This should happen for any Observable that is created from an object that is a subclass of ObjectProperties. In theory the following code should clean up everything:
artifact = Artifact(data, Artifact.TYPE_NETWORK)
obs = Observable(artifact)
// Do Something
del obs
del artifact
Theory is better than I would have gotten - alas, the delete doesn't actually delete the objects.
After a loop of 20:
Partition of a set of 298 objects. Total size = 55448 bytes.
Index Count % Size % Cumulative % Referrers by Kind (class / dict of class)
0 40 13 7040 13 7040 13 dict of cybox.objects.artifact_object.Artifact
1 20 7 5600 10 12640 23 0x20bd930
2 20 7 5600 10 18240 33 cybox.core.object.Object
3 20 7 5600 10 23840 43 cybox.objects.artifact_object.Artifact
4 20 7 5600 10 29440 53 dict of 0x20bd930
5 20 7 5600 10 35040 63 dict of cybox.core.object.Object
6 65 22 4864 9 39904 72 dict (no owner)
7 12 4 3360 6 43264 78 dict of guppy.etc.Glue.Owner
8 2 1 2096 4 45360 82 guppy.heapy.Classifiers.ByRetClaSet
9 20 7 1280 2 46640 84 dict (no owner), dict of
cybox.objects.artifact_object.Artifact
If you are willing to try out a code change in the library, you can do this and see if it cleans up the objects:
In the file cybox/common/object_properties.py add the following at line 67:
def __del__(self):
del self._parent
Added - same result :(
Did you do this in combination with the call to del obs and del artifact in the loop?
Hey, @coolacid and @averagesecurityguy
I'm taking a look at this now. I think the parent
thing may be a red herring. I tried converting them to weakref.proxy
objects, with no success. I was able to reproduce the leak with just o = Object()
in the loop, so my current thinking is that it's related to the related_objects
somehow.
Ok, found it. There is logic that caches every Object
when it's created. Let me verify the workaround then I'll let you know.
The error is this line (specifically, the postset_hook
).
The workaround is to put this at the top of your script somewhere:
Object.id_.postset_hook = None
I'm going to leave this issue open, since that should not be the default behavior, and users shouldn't have to do that.
Confirmed against my test script, Also needed to import Object from cybox.core as well.
I'll implement this in my actual scripts and report back tomorrow.
Thanks!
Also needed to import Object from cybox.core as well.
Yes, sorry, I added that import to the top of my copy the script, since I was testing creating different combinations of Observable
s, Object
s, and Artifact
s.
Confirmed - this appears to have solved the issues. Thanks!
I don't know if we want to remove that functionality. Tracking down the changes, it appears to be the default behavior since 2015. It is hard to track and say who makes use of the caching feature, but removing it might introduce problems for other users workflow. Note that, there is a method to clear the global cache under cybox.utils.caches.cache_clear()
that you can use to clear the DictCache at any moment.
I can add some documentation on the Object class to reference and encourage the use of the aforementioned method to prevent the memory issue you are experiencing.
If I remember correctly, it has to do with parsing CybOX content that contains related objects. I'm OK leaving this in, given the caveat I saw you added to the documentation in #308.
The workaround described by @emmanvg in #308 is the best we're going to be able to do right now. I'm worried that the current behavior is required by code that parses objects (particularly related objects). I'm going to leave this issue open, but remove it from the current milestone.
I created #317 to address better documentation of this issue, but I don't think we should actually change the caching behavior, for reasons discussed above.
When creating multiple Artifacts/Observables in a loop memory usage seems to jump by the object size. When using hpy from guppy to watch the python heap, it shows each iteration keeps a copy of our Observable object even though we don't do anything with it.
Possible Related to #196
Simple example:
dd count=10 bs=1M if=/dev/zero of=zeros
Running the above script with a zeros file created using the dd command above shows
dict of cybox.objects.artifact_object.Artifact
increasing in size each iteration.This only happens every time you create a new Artifact and Observable. If the artifact creation line is outside of the for loop, the memory doesn't run away.
Why this is a problem:
If we're looping through a bunch of files, and creating new cybox packages and sending them out, this bug causes memory to run away until either an OOM error happens, or the script finishes. There doesn't seem to be a way to clear the heap space between creating new packages.