TAXIIProject / libtaxii

A Python library for handling TAXII Messages invoking TAXII Services.
http://libtaxii.readthedocs.org/
BSD 3-Clause "New" or "Revised" License
70 stars 43 forks source link

Enable parsing of huge XML files #18

Open MarkDavidson opened 11 years ago

MarkDavidson commented 11 years ago

Libtaxii does not currently support parsing of huge xml files.

There are two aspects to this issue:

  1. Decide if there is any downside to XMLParser(huge_tree=True) always being on
  2. Enable XMLParser(huge_tree=True)

Ref: http://stackoverflow.com/questions/11850345/using-python-lxml-etree-for-huge-xml-files http://lxml.de/api/lxml.etree.XMLParser-class.html

traut commented 7 years ago

this issue is again reproducible with latest libtaxii

daybarr commented 5 years ago

Decide if there is any downside to XMLParser(huge_tree=True) always being on

I think there is.

I tried it, and with some deeply nested content I could make python 2.7.15 and python 3.7.2 on Windows silently crash and return a non-zero exit code. No python traceback, no crash message. It just quits.

So enabling huge_tree wouldn't be ideal if libtaxii was used in a webserver which received and attempted to parse untrusted messages - this would be a simple vector for DoS.

Here is the script to reproduce:

from __future__ import print_function

from lxml import etree
import libtaxii.messages_11 as tm11
import libtaxii.common

if __name__ == '__main__':
    num_levels = 100 * 1000
    inbox_msg_bytes = tm11.InboxMessage(
        message_id=tm11.generate_message_id(),
        content_blocks=[
            tm11.ContentBlock(
                content_binding='urn:example.com:huge_tree_issue:18',
                content=etree.Element('deep-nest'),
            )
        ]
    ).to_xml().replace(
        b'<deep-nest/>',
        (b'<x>' * num_levels) + (b'</x>' * num_levels)
    )

    print('XML is %d bytes long' % (len(inbox_msg_bytes),))
    print(inbox_msg_bytes[:400])
    print('...')
    print(inbox_msg_bytes[-400:])

    libtaxii.common.set_xml_parser(
        etree.XMLParser(
            attribute_defaults=False,
            dtd_validation=False,
            load_dtd=False,
            no_network=True,
            ns_clean=True,
            recover=False,
            remove_blank_text=False,
            remove_comments=False,
            remove_pis=False,
            strip_cdata=True,
            compact=True,
            # collect_ids=True,
            resolve_entities=False,
            huge_tree=True,  ###################### This is different
        )
    )
    msg = tm11.get_message_from_xml(inbox_msg_bytes)
    print('Loaded %r' % (msg,))

Example output:

XML is 700442 bytes long
b'<taxii_11:Inbox_Message xmlns:taxii="http://taxii.mitre.org/messages/taxii_xml_binding-1" xmlns:taxii_11="http://taxii.mitre.org/messages/taxii_xml_binding-1.1" xmlns:tdq="http://taxii.mitre.org/query/taxii_default_query-1" message_id="153575343088057277"><taxii_11:Content_Block><taxii_11:Content_Binding binding_id="urn:example.com:huge_tree_issue:18"/><taxii_11:Content><x><x><x><x><x><x><x><x><x>'
...
b'/x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></taxii_11:Content></taxii_11:Content_Block></taxii_11:Inbox_Message>'

shell returned 127

Note how it never prints the "Loaded" message and returns an exit code of 127 indicating some kind of crash?

The script runs fine if you reduce num_levels from 100,000 to 10,000, but a 700KB input file is not "too big" to exclude on size alone.

My full list of deps, from a dev install of libtaxii in a Windows venv with python 3.7.2:

$ pip freeze -l                                                                                           
alabaster==0.7.12                                                                                         
Babel==2.7.0                                                                                              
bumpversion==0.5.3                                                                                        
certifi==2019.3.9                                                                                         
chardet==3.0.4                                                                                            
colorama==0.4.1                                                                                           
docutils==0.14                                                                                            
idna==2.8                                                                                                 
imagesize==1.1.0                                                                                          
importlib-metadata==0.16                                                                                  
Jinja2==2.10.1                                                                                            
-e git+https://github.com/TAXIIProject/libtaxii.git@7753399103b97e12af0d8e9f011ccf5be09a7842#egg=libtaxii 
lxml==4.3.3                                                                                               
MarkupSafe==1.1.1                                                                                         
pluggy==0.12.0                                                                                            
py==1.8.0                                                                                                 
Pygments==2.4.2                                                                                           
pytest==3.0.7                                                                                             
python-dateutil==2.8.0                                                                                    
pytz==2019.1                                                                                              
requests==2.22.0                                                                                          
six==1.12.0                                                                                               
snowballstemmer==1.2.1                                                                                    
Sphinx==1.6.1                                                                                             
sphinx-rtd-theme==0.2.4                                                                                   
sphinxcontrib-websupport==1.1.2                                                                           
tox==2.7.0                                                                                                
typing==3.6.6                                                                                             
urllib3==1.25.3                                                                                           
virtualenv==16.6.0                                                                                        
zipp==0.5.1