SWI-Prolog / packages-sgml

The SWI-Prolog SGML/XML/HTML parser
4 stars 10 forks source link

Add ignore_doctype/1 option to parser #95

Closed thetrime closed 1 year ago

thetrime commented 1 year ago

XML is a bit of a poison pill - if you do it according to the spec, you end up with all kinds of security headaches. There are means of mitigating many of these, but one area that might be useful in cutting out huge chunks of potential risk is simply ignoring embedded DOCTYPE directives and forcing the DTD to be specified by the caller (if required).

The documentation for load_structure says that (emphasis mine):

The Options list controls the conversion process. Currently defined options are below. Other options are passed to sgml_parse/2. ... dtd(?DTD) Reference to a DTD object. If specified, the <!DOCTYPE ...> declaration is ignored and the document is parsed and validated against the provided DTD. If provided as a variable, the created DTD is returned. See section 3.5.

However, trying to load the following with this query: new_dtd(foo, DTD), load_structure('path to file>, S, [dtd(DTD)]). gives me this binding: S = [element(foo, [], [lollollollollollollollollollol])] which shows the DOCTYPE isn't being ignored.

<?xml version="1.0"?>
<!DOCTYPE foo [
<!ELEMENT foo (#PCDATA)>
<!ENTITY lol "lol">
<!ENTITY lol1 "&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;">]>
<foo>&lol1;</foo>

I'll provide some code to definitively turn off DOCTYPE processing, but default it to the existing behaviour for backward compatibility