STIXProject / python-stix

A Python library for parsing, manipulating, and generating STIX content.
http://stix.readthedocs.org/
BSD 3-Clause "New" or "Revised" License
239 stars 88 forks source link

Improve performance of to_xml() #163

Closed gtback closed 10 years ago

gtback commented 10 years ago

The big bottleneck of to_xml() is the recursive inspection for namespaces:

import cProfile
from stix.core.stix_package import STIXPackage
sp = STIXPackage.from_xml("/home/gback/share/stix1.1.1-samples-all/APT1/Appendix_G_IOCs_Full.xml")
cProfile.run("sp.to_xml()", sort=('cumtime'))
         11099453 function calls (11009579 primitive calls) in 5.866 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    5.866    5.866 <string>:1(<module>)
        1    0.012    0.012    5.866    5.866 base.py:36(to_xml)
        1    0.000    0.000    2.170    2.170 base.py:25(_get_namespaces)
        1    0.000    0.000    2.170    2.170 nsparser.py:44(get_namespaces)
   9965/1    0.043    0.000    2.170    2.170 nsparser.py:21(_get_namespace_set)
        1    0.000    0.000    1.894    1.894 stix_package.py:284(to_obj)
6827/3303    0.037    0.000    1.882    0.001 observable.py:121(to_obj)
        1    0.003    0.003    1.860    1.860 observable.py:239(to_obj)
    26126    0.312    0.000    1.823    0.000 base.py:56(_get_children)
     3256    0.015    0.000    1.811    0.001 object.py:80(to_obj)
3277/3256    0.006    0.000    1.781    0.001 object_properties.py:92(to_obj)
6113/3258    0.170    0.000    1.776    0.001 __init__.py:90(to_obj)
  3411562    0.750    0.000    1.447    0.000 {isinstance}
    12359    0.391    0.000    1.328    0.000 inspect.py:247(getmembers)
     6794    0.011    0.000    1.180    0.000 nsparser.py:14(_get_observable_namespaces)
...
bworrell commented 10 years ago

I took a pass at adding parameters to to_xml() to allow developers to bypass the automatic collection of namespaces in hopes that it would improve performance a bit.

The https://github.com/STIXProject/python-stix/commit/820931838676b0fb027b6fc4a5860db3e68807e2 commit introduces an auto_namespace parameter with a default value of True. When set to False, the namespaces and schemalocations are all pulled from the ns_dict and schemaloc_dict parameters.

I ran the following script to get a feel for the performance difference:

#!/usr/bin/env python
from stix.core import STIXPackage

FN = "Appendix_G_IOCs_Full.xml"
print "[-] Parsing", FN

PACKAGE = STIXPackage.from_xml(FN)
NS_DICT = {
       'http://cybox.mitre.org/objects#WinDriverObject-3': 'WinDriverObj',
       'http://stix.mitre.org/common-1': 'stixCommon',
       'http://data-marking.mitre.org/extensions/MarkingStructure#Terms_Of_Use-1': 'TOUMarking',
       'http://www.w3.org/2001/XMLSchema-instance': 'xsi',
       'http://cybox.mitre.org/objects#WinProcessObject-2': 'WinProcessObj',
       'http://cybox.mitre.org/common-2': 'cyboxCommon',
       'http://cybox.mitre.org/objects#WinServiceObject-2': 'WinServiceObj',
       'http://cybox.mitre.org/objects#FileObject-2': 'FileObj',
       'http://cybox.mitre.org/objects#ProcessObject-2': 'ProcessObj',
       'http://stix.mitre.org/stix-1': 'stix',
       'http://cybox.mitre.org/cybox-2': 'cybox',
       'http://stix.mitre.org/Incident-1': 'incident',
       'http://data-marking.mitre.org/Marking-1': 'marking',
       'http://cybox.mitre.org/default_vocabularies-2': 'cyboxVocabs',
       'http://stix.mitre.org/extensions/TestMechanism#OpenIOC2010-1': 'stix-openioc',
       'http://stix.mitre.org/default_vocabularies-1': 'stixVocabs',
       'http://cybox.mitre.org/objects#WinRegistryKeyObject-2': 'WinRegistryKeyObj',
       'http://cybox.mitre.org/objects#WinExecutableFileObject-2': 'WinExecutableFileObj',
       'http://example.com': 'example',
       'http://cybox.mitre.org/objects#WinFileObject-2': 'WinFileObj',
       'http://cybox.mitre.org/objects#WinHandleObject-2': 'WinHandleObj',
       'http://www.mandiant.com': 'mandiant',
       'http://stix.mitre.org/TTP-1': 'ttp',
       'http://stix.mitre.org/Indicator-2': 'indicator'
}

SCHEMALOC_DICT = {
         'http://cybox.mitre.org/common-2': 'http://cybox.mitre.org/XMLSchema/common/2.1/cybox_common.xsd',
         'http://cybox.mitre.org/cybox-2': 'http://cybox.mitre.org/XMLSchema/core/2.1/cybox_core.xsd',
         'http://cybox.mitre.org/default_vocabularies-2': 'http://cybox.mitre.org/XMLSchema/default_vocabularies/2.1/cybox_default_vocabularies.xsd',
         'http://cybox.mitre.org/objects#FileObject-2': 'http://cybox.mitre.org/XMLSchema/objects/File/2.1/File_Object.xsd',
         'http://cybox.mitre.org/objects#ProcessObject-2': 'http://cybox.mitre.org/XMLSchema/objects/Process/2.1/Process_Object.xsd',
         'http://cybox.mitre.org/objects#WinDriverObject-2': '../../cybox/objects/Win_Driver_Object.xsd',
         'http://cybox.mitre.org/objects#WinDriverObject-3': 'http://cybox.mitre.org/XMLSchema/objects/Win_Driver/3.0/Win_Driver_Object.xsd',
         'http://cybox.mitre.org/objects#WinExecutableFileObject-2': 'http://cybox.mitre.org/XMLSchema/objects/Win_Executable_File/2.1/Win_Executable_File_Object.xsd',
         'http://cybox.mitre.org/objects#WinFileObject-2': 'http://cybox.mitre.org/XMLSchema/objects/Win_File/2.1/Win_File_Object.xsd',
         'http://cybox.mitre.org/objects#WinHandleObject-2': 'http://cybox.mitre.org/XMLSchema/objects/Win_Handle/2.1/Win_Handle_Object.xsd',
         'http://cybox.mitre.org/objects#WinProcessObject-2': 'http://cybox.mitre.org/XMLSchema/objects/Win_Process/2.1/Win_Process_Object.xsd',
         'http://cybox.mitre.org/objects#WinRegistryKeyObject-2': 'http://cybox.mitre.org/XMLSchema/objects/Win_Registry_Key/2.1/Win_Registry_Key_Object.xsd',
         'http://cybox.mitre.org/objects#WinServiceObject-2': 'http://cybox.mitre.org/XMLSchema/objects/Win_Service/2.1/Win_Service_Object.xsd',
         'http://data-marking.mitre.org/Marking-1': 'http://stix.mitre.org/XMLSchema/data_marking/1.1.1/data_marking.xsd',
         'http://data-marking.mitre.org/extensions/MarkingStructure#Terms_Of_Use-1': 'http://stix.mitre.org/XMLSchema/extensions/marking/terms_of_use/1.0/terms_of_use_marking.xsd',
         'http://stix.mitre.org/Incident-1': 'http://stix.mitre.org/XMLSchema/incident/1.1.1/incident.xsd',
         'http://stix.mitre.org/Indicator-2': 'http://stix.mitre.org/XMLSchema/indicator/2.1.1/indicator.xsd',
         'http://stix.mitre.org/TTP-1': 'http://stix.mitre.org/XMLSchema/ttp/1.1.1/ttp.xsd',
         'http://stix.mitre.org/common-1': 'http://stix.mitre.org/XMLSchema/common/1.1.1/stix_common.xsd',
         'http://stix.mitre.org/default_vocabularies-1': 'http://stix.mitre.org/XMLSchema/default_vocabularies/1.1.1/stix_default_vocabularies.xsd',
         'http://stix.mitre.org/extensions/TestMechanism#OpenIOC2010-1': 'http://stix.mitre.org/XMLSchema/extensions/test_mechanism/open_ioc_2010/1.1.1/open_ioc_2010_test_mechanism.xsd',
         'http://stix.mitre.org/extensions/Test_Mechanism#OpenIOC2010-1': '../../extensions/test_mechanism/open_ioc_2010_test_mechanism.xsd',
         'http://stix.mitre.org/stix-1': 'http://stix.mitre.org/XMLSchema/core/1.1.1/stix_core.xsd'
}

def test_auto_namespace():
    foo = PACKAGE.to_xml()

def test_explicit_namespace():
    foo = PACKAGE.to_xml(auto_namespace=False, ns_dict=NS_DICT,
                         schemaloc_dict=SCHEMALOC_DICT)

if __name__ == "__main__":
    from timeit import Timer
    t_auto = Timer("test_auto_namespace()", "from __main__ import test_auto_namespace")
    print "test_auto_namespace: ", t_auto.repeat(5, 1)

    t_explicit = Timer("test_explicit_namespace()", "from __main__ import test_explicit_namespace")
    print "test_explicit_namespace: ", t_explicit.repeat(5, 1)

Which resulted with:


$ python -W ignore ./time_turnaround.py
[-] Parsing Appendix_G_IOCs_Full.xml
test_auto_namespace:  [3.804380178451538, 3.7833778858184814, 3.786379098892212, 3.7823779582977295, 3.7763779163360596]
test_explicit_namespace:  [1.4481451511383057, 1.4481439590454102, 1.4461450576782227, 1.4471449851989746, 1.4501450061798096]

The test_auto_namespace() method used all default to_xml() params and usually took about 3.8 seconds to run. The test_explicit_namespace() method used pre-crafted namespace and schemalocation dictionaries and passed them into to_xml(), also setting auto_namespace=False which averaged around 1.45 seconds to run.

I used the Mandiant APT1 Full sample document (also used in the cProfile run) for these tests. It seems like a big improvement if you know exactly what namespaces you're going to be exporting.

Note: I've never used timeit before..so I could have totally misused it :)

gtback commented 10 years ago

I see a speedup of around 50% in cProfile using the code you've added.

cProfile.run("sp.to_xml(auto_namespace=False, ns_dict=NS_DICT, schemaloc_dict=SCHEMALOC_DICT)", sort=('cumtime'))

I vote for merging the code and closing this!

iglocska commented 10 years ago

I'm getting very different results from you guys when running my STIX export for MISP. An event with ~800 attributes exports in about 2 seconds compared to the previous 35ish. Amazing job, thanks a lot!

gtback commented 10 years ago

Glad this helped, @iglocska ! I do most of my development in a VM, so I'm not surprised that your performance improved even more than mine.

usrlocalben commented 10 years ago

Why closed? This is not a solution to the problem.

I spent a few hours looking at it, and my takeaway is that the child-list generator is terribly expensive, since it's done through introspection and makes an incomprehensibly large number of isinstance/hasattr/dir calls. The bindings layer is able to do this without all that cost when rendering -- why can't the xml rendering also accumulate the NS-map and return it back to the root to_xml call to inject into the final output stream? After all, it doesn't really stream, since the intial call returns a complete buffer. Even if all of the work was done -twice- : once to accumulate the namespaces, and another to render it with the namespaces included -- it may still be orders of magnitude faster.

gtback commented 10 years ago

We try to avoid "post-processing" the output of the bindings, since the way it generates output using string buffers has proven to be rather fragile.

You are certainly correct that it may be faster; I have never claimed that there is no way to further optimize this code. @bworrell's changes were a quick way to get >50% (and in @iglocska's case, >90%) performance improvement, so that was enough for me to close this issue. If you have further improvements that create demonstrably better performance, we'd be happy to accept them.

gtback commented 10 years ago

abfbdedd5ad7b077cff03ec3579eb9834dd50996 also got us an additional 10% improvement on the sample file above (not bad for a 1-character change)... thanks @bworrell

usrlocalben commented 10 years ago

On 8/28/2014 3:11 PM, Greg Back wrote:

abfbded https://github.com/STIXProject/python-stix/commit/abfbdedd5ad7b077cff03ec3579eb9834dd50996 also got us an additional 10% improvement on the sample file above (not bad for a 1-character change)... thanks @bworrell https://github.com/bworrell

— Reply to this email directly or view it on GitHub https://github.com/STIXProject/python-stix/issues/163#issuecomment-53780375.Web Bug from https://github.com/notifications/beacon/1012320__eyJzY29wZSI6Ik5ld3NpZXM6QmVhY29uIiwiZXhwaXJlcyI6MTcyNDg3MjMxNiwiZGF0YSI6eyJpZCI6Mzg2MzAwODl9fQ==--fc0cc319a7f4d63f781fa36dd0c81fd043653a4a.gif

I'd revert it -- cStringIO doesn't do unicode. :(

gtback commented 10 years ago

We convert everything to UTF-8 before passing it to the bindings (or if not, we should be :wink:)

usrlocalben commented 10 years ago

mixed unicode & utf-8, as mentioned in https://github.com/STIXProject/python-stix/pull/179.

bworrell commented 10 years ago

I've been playing around in two branches (https://github.com/STIXProject/python-stix/tree/speedup and https://github.com/CybOXProject/python-cybox/tree/speedup) and have been getting some decent performance improvements:

Before (https://github.com/STIXProject/python-stix/commit/820931838676b0fb027b6fc4a5860db3e68807e2):

[-] Parsing Appendix_G_IOCs_Full.xml
test_auto_namespace(): AVG 3.821000 : [3.8319997787475586, 3.796999931335449, 3.7940001487731934, 3.809999942779541, 3.871999979019165]
test_auto_no_schemaloc(): AVG 3.035800 : [3.0390000343322754, 3.0429999828338623, 3.0420000553131104, 3.0279998779296875, 3.0269999504089355]
test_explicit_namespace(): AVG 1.461200 : [1.4670000076293945, 1.4590001106262207, 1.4560000896453857, 1.4599997997283936, 1.4639999866485596]
test_explicit_namespace_no_pretty(): AVG 1.241400 : [1.24399995803833, 1.2360000610351562, 1.241999864578247, 1.2430000305175781, 1.2420001029968262]
test_explicit_namespace_no_pretty_no_schemaloc(): AVG 1.238200 : [1.2409999370574951, 1.2380001544952393, 1.2349998950958252, 1.2390000820159912, 1.2379999160766602]

Right now (e8d4def07d56eaf36c35cd97ba8cc379e53ead44):

[-] Parsing Appendix_G_IOCs_Full.xml
test_auto_namespace(): AVG 1.383805 : [1.3847990036010742, 1.395442008972168, 1.3858788013458252, 1.378965139389038, 1.3739409446716309]
test_auto_no_schemaloc(): AVG 1.377758 : [1.377091884613037, 1.377608060836792, 1.3745770454406738, 1.3743751049041748, 1.3851358890533447]
test_auto_no_pretty(): AVG 1.325101 : [1.343013048171997, 1.319584846496582, 1.3266880512237549, 1.3188409805297852, 1.3173799514770508]
test_explicit_namespace(): AVG 1.021751 : [1.0183520317077637, 1.0169179439544678, 1.0358390808105469, 1.0190579891204834, 1.0185868740081787]
test_explicit_namespace_no_pretty(): AVG 0.961635 : [0.9630279541015625, 0.9609689712524414, 0.9626798629760742, 0.960273027420044, 0.9612250328063965]
test_explicit_namespace_no_pretty_no_schemaloc(): AVG 0.958724 : [0.9744021892547607, 0.9552838802337646, 0.9550559520721436, 0.9544360637664795, 0.9544429779052734]

Granted, I probably broke a lot of things and don't realize it yet :) The code is all pretty nasty and there are poorly architected redundancies everywhere since I was just hacking around, but going from 3.82 to 1.38 is a nice improvement on auto-namespace output. That's about the same improvement we saw after introducing the auto_namespace parameter.

bworrell commented 10 years ago

Reworked both python-stix and python-cybox to_obj() and to_xml() (https://github.com/STIXProject/python-stix/commit/3bda593bc4327a1982d0b29912b59cc4959c5dcb, https://github.com/CybOXProject/python-cybox/commit/f68a60fd781902cde1a71bac9ac40b54d4dc59ff) calls to pass around a NamespaceInfo object that collects namespaces during to_obj(), thereby removing the Python object walk for namespace collection.

As of https://github.com/STIXProject/python-stix/commit/3bda593bc4327a1982d0b29912b59cc4959c5dcb

[-] Parsing Appendix_G_IOCs_Full.xml
test_auto_namespace(): AVG 1.178540 : [1.1777551174163818, 1.17979097366333, 1.1722688674926758, 1.180130958557129, 1.182753086090088]
test_auto_no_schemaloc(): AVG 1.166549 : [1.1674880981445312, 1.1656649112701416, 1.1567730903625488, 1.174548864364624, 1.1682698726654053]
test_auto_no_pretty(): AVG 1.105770 : [1.11037015914917, 1.1059870719909668, 1.1029610633850098, 1.1092350482940674, 1.100295066833496]
test_explicit_namespace(): AVG 1.070005 : [1.044726848602295, 1.0948448181152344, 1.0599980354309082, 1.0786218643188477, 1.0718309879302979]
test_explicit_namespace_no_pretty(): AVG 0.996998 : [1.0012049674987793, 0.9895370006561279, 0.995305061340332, 1.0026469230651855, 0.9962971210479736]
test_explicit_namespace_no_pretty_no_schemaloc(): AVG 0.999876 : [1.013319969177246, 0.993354082107544, 0.9977469444274902, 1.0052127838134766, 0.9897480010986328]

Narrowed the gap between enabling and disabling auto_namespace and overall improved the serialization times.

Calling obj = to_obj() and then obj.export(...) produces these times:

test_obj_create_serial(): AVG 1.046428 : [1.047217845916748, 1.0400071144104004, 1.0564820766448975, 1.042720079421997, 1.0457139015197754]

I did a bit of code cleanup in a846f032fe51ad1eeeacc8a48998bfe6d648de30 but I'm sure there is more that could be done. I'm a bit worried about the brittleness of some of my speed hacks causing problems down the road (or now and I just haven't noticed any issues yet) and the weirdness of the ns_info param on to_obj() in python-cybox and python-stix.

usrlocalben commented 10 years ago

You win the internet.

iglocska commented 10 years ago

Woah, awesome work! Can't wait to play around with it tomorrow!

bworrell commented 10 years ago

Thanks @benjamin9999 and @iglocska! If you guys mess around with it please let us know if something blows up :)