Closed GitMew closed 1 year ago
One fix seems to be to embed the .dtd entities inside the XML, like
<?xml version="1.0"?>
<!DOCTYPE lex [
<!ENTITY Acirc "Â">
]>
<lex id="tstlex">
<entry type="lem" lid="264354">
<lem>Âld</lem>
... and to then avoid the BigXmlException: Entity definition is forbidden
exception, modify the __init__
method in bigxml/parser.py
by including forbid_entities=False
.
Hello and thank you for your report and investigations.
I think that you are right, the entity must be embedded in the file and cannot be loaded from an external DTD file.
And then the entity definitions are blocked due to security concerns (BigXML aims to be secure from usual attacks against XML parsers, and this would allow attacks such as billion laughs attack).
Now with all that said and done, it does not solve your issue in a constructive way. Please tell me what you think about the below!
1. Providing a new parameter to Parser
This parameter would allow for entity definitions, so that you don't need to modify bigxml/parser.py
directly.
2. Programatically insert the external DTD file
Some code as the following would allow you to manually replace the <!DOCTYPE lex SYSTEM "tstlex.dtd">
part with the content of the DTD file (here we assume that the doctype is alone on the second line like in your example, but you get the idea):
def get_data(filepath):
with open(filepath, "rb") as f_src:
# first line
yield next(f_src)
# second line
# consume the line
assert next(f_src).startswith(b"<!DOCTYPE lex")
# include the content of the DTD file instead
yield b"<!DOCTYPE lex ["
with open("tstlex.dtd", "rb") as f_dtd:
yield f_dtd
yield b"]>"
# rest of file
yield f_src
# use it
for item in Parser(get_data("file.xml")).iter_from(Entry):
print(item)
This is not ideal, but could avoid modifying the XML file by hand.
I hadn't thought of such a solution, but indeed, even though there is no explicit way to pass a .dtd file's content, you can solve that by pointing the reading head to a different file and pretend as if it is inside the same file.
One detail that your code snippet should address, however, is that the .dtd file can have any name and be declared with any tag name. lex
is just the tag name in my particular case. It wouldn't work in general. The include statement seems to be something like
<!DOCTYPE {toplevel-tagname} SYSTEM "{dtd-filename}">
where the second field can perhaps even be extracted automatically and hence the user wouldn't have to specify the .dtd file themselves.
You are right, some fine-tuning of my code snippet would be needed. I just wanted to illustrate that you could write something leveraging the flexibility in the kinds streams passed to Parser
to make it work. More info about the streams kinds in the documentation.
Starting from version 0.10.0 of BigXML (just released), a parameter insecurely_allow_entities
can be added to the Parser
call to allow entities defined in doctypes to be used! This way, no need to modify Parser.__init__
by hand anymore.
>>> @xml_handle_element("root")
... def handle_root(node):
... yield node.text
>>> XML = b'<!DOCTYPE root [<!ENTITY pi "π">]><root>π</root>'
>>> Parser(XML, insecurely_allow_entities=True).return_from(handle_root)
'π'
I believe you now have the right building blocks in your hands to handle your usecase :rocket:
I'm essentially having the exact problem as posed in this thread. The XML file I'm trying to parse starts out as
and the accompanying
.dtd
file has a line that says<!ENTITY Acirc "Â">
. I'm running something similar to the tutorial:and getting the exception
bigxml.exceptions.BigXmlException: Undefined entity Â: line 5, column 9
.How do I pass the information in the
.dtd
file tobigxml
'sParser
?