Open MarkDavidson opened 11 years ago
this issue is again reproducible with latest libtaxii
Decide if there is any downside to XMLParser(huge_tree=True) always being on
I think there is.
I tried it, and with some deeply nested content I could make python 2.7.15 and python 3.7.2 on Windows silently crash and return a non-zero exit code. No python traceback, no crash message. It just quits.
So enabling huge_tree
wouldn't be ideal if libtaxii was used in a webserver which received and attempted to parse untrusted messages - this would be a simple vector for DoS.
Here is the script to reproduce:
from __future__ import print_function
from lxml import etree
import libtaxii.messages_11 as tm11
import libtaxii.common
if __name__ == '__main__':
num_levels = 100 * 1000
inbox_msg_bytes = tm11.InboxMessage(
message_id=tm11.generate_message_id(),
content_blocks=[
tm11.ContentBlock(
content_binding='urn:example.com:huge_tree_issue:18',
content=etree.Element('deep-nest'),
)
]
).to_xml().replace(
b'<deep-nest/>',
(b'<x>' * num_levels) + (b'</x>' * num_levels)
)
print('XML is %d bytes long' % (len(inbox_msg_bytes),))
print(inbox_msg_bytes[:400])
print('...')
print(inbox_msg_bytes[-400:])
libtaxii.common.set_xml_parser(
etree.XMLParser(
attribute_defaults=False,
dtd_validation=False,
load_dtd=False,
no_network=True,
ns_clean=True,
recover=False,
remove_blank_text=False,
remove_comments=False,
remove_pis=False,
strip_cdata=True,
compact=True,
# collect_ids=True,
resolve_entities=False,
huge_tree=True, ###################### This is different
)
)
msg = tm11.get_message_from_xml(inbox_msg_bytes)
print('Loaded %r' % (msg,))
Example output:
XML is 700442 bytes long
b'<taxii_11:Inbox_Message xmlns:taxii="http://taxii.mitre.org/messages/taxii_xml_binding-1" xmlns:taxii_11="http://taxii.mitre.org/messages/taxii_xml_binding-1.1" xmlns:tdq="http://taxii.mitre.org/query/taxii_default_query-1" message_id="153575343088057277"><taxii_11:Content_Block><taxii_11:Content_Binding binding_id="urn:example.com:huge_tree_issue:18"/><taxii_11:Content><x><x><x><x><x><x><x><x><x>'
...
b'/x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></x></taxii_11:Content></taxii_11:Content_Block></taxii_11:Inbox_Message>'
shell returned 127
Note how it never prints the "Loaded" message and returns an exit code of 127 indicating some kind of crash?
The script runs fine if you reduce num_levels
from 100,000 to 10,000, but a 700KB input file is not "too big" to exclude on size alone.
My full list of deps, from a dev install of libtaxii in a Windows venv with python 3.7.2:
$ pip freeze -l
alabaster==0.7.12
Babel==2.7.0
bumpversion==0.5.3
certifi==2019.3.9
chardet==3.0.4
colorama==0.4.1
docutils==0.14
idna==2.8
imagesize==1.1.0
importlib-metadata==0.16
Jinja2==2.10.1
-e git+https://github.com/TAXIIProject/libtaxii.git@7753399103b97e12af0d8e9f011ccf5be09a7842#egg=libtaxii
lxml==4.3.3
MarkupSafe==1.1.1
pluggy==0.12.0
py==1.8.0
Pygments==2.4.2
pytest==3.0.7
python-dateutil==2.8.0
pytz==2019.1
requests==2.22.0
six==1.12.0
snowballstemmer==1.2.1
Sphinx==1.6.1
sphinx-rtd-theme==0.2.4
sphinxcontrib-websupport==1.1.2
tox==2.7.0
typing==3.6.6
urllib3==1.25.3
virtualenv==16.6.0
zipp==0.5.1
Libtaxii does not currently support parsing of huge xml files.
There are two aspects to this issue:
Ref: http://stackoverflow.com/questions/11850345/using-python-lxml-etree-for-huge-xml-files http://lxml.de/api/lxml.etree.XMLParser-class.html