Rogdham / bigxml

Parse big xml files and streams with ease
https://bigxml.rogdham.net/
MIT License
27 stars 3 forks source link

Undefined entity exception when using custom entities defined in external doctype #3

Closed GitMew closed 1 year ago

GitMew commented 1 year ago

I'm essentially having the exact problem as posed in this thread. The XML file I'm trying to parse starts out as

<?xml version="1.0"?>
<!DOCTYPE lex SYSTEM "tstlex.dtd">
<lex id="tstlex">
  <entry type="lem" lid="264354">
    <lem>&Acirc;ld</lem>

and the accompanying .dtd file has a line that says <!ENTITY Acirc "&#194;">. I'm running something similar to the tutorial:

@xml_handle_element("lex", "entry")
@dataclass
class Entry:
    lemma: str = 'N/A'

    @xml_handle_element("lem")
    def handle_lem(self, node):
        self.lemma = node.text

if __name__ == "__main__":
    with open(filepath, "rb") as f:
        for item in Parser(f).iter_from(Entry):
            print(item)

and getting the exception bigxml.exceptions.BigXmlException: Undefined entity &Acirc;: line 5, column 9.

How do I pass the information in the .dtd file to bigxml's Parser?

GitMew commented 1 year ago

One fix seems to be to embed the .dtd entities inside the XML, like

<?xml version="1.0"?>
<!DOCTYPE lex [
<!ENTITY Acirc "&#194;">
]>
<lex id="tstlex">
  <entry type="lem" lid="264354">
    <lem>&Acirc;ld</lem>

... and to then avoid the BigXmlException: Entity definition is forbidden exception, modify the __init__ method in bigxml/parser.py by including forbid_entities=False.

Rogdham commented 1 year ago

Hello and thank you for your report and investigations.

I think that you are right, the entity must be embedded in the file and cannot be loaded from an external DTD file.

And then the entity definitions are blocked due to security concerns (BigXML aims to be secure from usual attacks against XML parsers, and this would allow attacks such as billion laughs attack).


Now with all that said and done, it does not solve your issue in a constructive way. Please tell me what you think about the below!

1. Providing a new parameter to Parser

This parameter would allow for entity definitions, so that you don't need to modify bigxml/parser.py directly.

2. Programatically insert the external DTD file

Some code as the following would allow you to manually replace the <!DOCTYPE lex SYSTEM "tstlex.dtd"> part with the content of the DTD file (here we assume that the doctype is alone on the second line like in your example, but you get the idea):

def get_data(filepath):
    with open(filepath, "rb") as f_src:
        # first line
        yield next(f_src)

        # second line
        # consume the line
        assert next(f_src).startswith(b"<!DOCTYPE lex")
        # include the content of the DTD file instead
        yield b"<!DOCTYPE lex ["
        with open("tstlex.dtd", "rb") as f_dtd:
            yield f_dtd
        yield b"]>"

        # rest of file
        yield f_src

# use it
for item in Parser(get_data("file.xml")).iter_from(Entry):
    print(item)

This is not ideal, but could avoid modifying the XML file by hand.

GitMew commented 1 year ago

I hadn't thought of such a solution, but indeed, even though there is no explicit way to pass a .dtd file's content, you can solve that by pointing the reading head to a different file and pretend as if it is inside the same file.

One detail that your code snippet should address, however, is that the .dtd file can have any name and be declared with any tag name. lex is just the tag name in my particular case. It wouldn't work in general. The include statement seems to be something like

<!DOCTYPE {toplevel-tagname} SYSTEM "{dtd-filename}">

where the second field can perhaps even be extracted automatically and hence the user wouldn't have to specify the .dtd file themselves.

Rogdham commented 1 year ago

You are right, some fine-tuning of my code snippet would be needed. I just wanted to illustrate that you could write something leveraging the flexibility in the kinds streams passed to Parser to make it work. More info about the streams kinds in the documentation.

Starting from version 0.10.0 of BigXML (just released), a parameter insecurely_allow_entities can be added to the Parser call to allow entities defined in doctypes to be used! This way, no need to modify Parser.__init__ by hand anymore.


>>> @xml_handle_element("root")
... def handle_root(node):
...     yield node.text

>>> XML = b'<!DOCTYPE root [<!ENTITY pi "&#960;">]><root>&pi;</root>'

>>> Parser(XML, insecurely_allow_entities=True).return_from(handle_root)
'π'

I believe you now have the right building blocks in your hands to handle your usecase :rocket: