Open dwcramer opened 2 years ago
Thanks for the suggestion, @dwcramer - this should be perfectly possible, and I'll build it into the implementation of #2.
An extension of this would be to report out the internal subset itself as XML, in the manner of https://github.com/AndrewSales/dtd2xml.
For example, if I run that tool on this document:
<!DOCTYPE html [
<!ENTITY
% foo
SYSTEM
"c:/temp/bar.ent">
%foo;
]>
<html/>
it produces (assuming bar.ent
contains just <!ELEMENT foo EMPTY>
) for example:
<dtd>
<element id="foo" name="foo">
<model>EMPTY</model>
<parents></parents>
</element>
<externalEntity name="foo" systemId="c:/temp/bar.ent"/>
</dtd>
which may be useful information to have as XML per se, or a downstream process could format these to suit and supply the input envisaged in #5.
That would be wonderful! I think you'd want an xml:base
attribute on that dtd element so you can know where each report came from when doing downstream processing. I'm imagining a use case where you generate a report from a tree of files, modify the report via XSLT, then use doctype-tool to reapply the doctypes the files.
Note that entities can contain other unexpanded entities as well as xml. I feel like there will be some subtle issues to consider with nested entities, namespaces, and whitespace.
<!ENTITY foo "<p>this is a
para with whitespace that we can't mess with. And a random other &bar; entity
for good measure.
</p>">
<!ENTITY bar "<mml:math xmlns:mml='http://www.w3.org/1998/Math/MathML'/>">
See also https://en.wikipedia.org/wiki/Billion_laughs_attack
And of course once I have <externalEntity name="foo" systemId="c:/temp/bar.ent"/>
I'll also want to be able to turn the contents of c:/temp/bar.ent into a report, but that means you would have to take catalog files into account...
Good points all, @dwcramer - thanks. I'll set to work on this and see where it leads.
Enhancement: One challenge with operating on internal subsets via sed or grep is that spacing can be all over the place across a set of XML files:
is the same as:
It would be useful if doctype-tool could format all these in a canonical way (e.g. 'one line per declaration with non-meaningful whitespace normalized') so that you can then write simple grep or sed commands to search or modify them across a tree of files. Both of the above examples when normalized would become: