AndrewSales / doctype-tool

Tool for reporting and manipulating XML 1.0 document type declarations
Apache License 2.0
2 stars 0 forks source link

Provide option to canonicalize the internal subset #6

Open dwcramer opened 2 years ago

dwcramer commented 2 years ago

Enhancement: One challenge with operating on internal subsets via sed or grep is that spacing can be all over the place across a set of XML files:

<!DOCTYPE html "about:legacy-compat" [
       <!ENTITY % foo 
           SYSTEM "path/to/bar.ent"> %foo;
]>

is the same as:

<!DOCTYPE html "about:legacy-compat" [
  <!ENTITY 
   % foo 
  SYSTEM 
  "path/to/bar.ent">
 %foo;
]>

It would be useful if doctype-tool could format all these in a canonical way (e.g. 'one line per declaration with non-meaningful whitespace normalized') so that you can then write simple grep or sed commands to search or modify them across a tree of files. Both of the above examples when normalized would become:

<!DOCTYPE html "about:compat" [
       <!ENTITY % foo SYSTEM "path/to/bar.ent">
       %foo;
]>
AndrewSales commented 2 years ago

Thanks for the suggestion, @dwcramer - this should be perfectly possible, and I'll build it into the implementation of #2.

An extension of this would be to report out the internal subset itself as XML, in the manner of https://github.com/AndrewSales/dtd2xml.

For example, if I run that tool on this document:

<!DOCTYPE html [
  <!ENTITY 
  % foo 
  SYSTEM
  "c:/temp/bar.ent">
 %foo;
]>
<html/>

it produces (assuming bar.ent contains just <!ELEMENT foo EMPTY>) for example:

<dtd>
<element id="foo" name="foo">
<model>EMPTY</model>
<parents></parents>
</element>
<externalEntity name="foo" systemId="c:/temp/bar.ent"/>
</dtd>

which may be useful information to have as XML per se, or a downstream process could format these to suit and supply the input envisaged in #5.

dwcramer commented 2 years ago

That would be wonderful! I think you'd want an xml:base attribute on that dtd element so you can know where each report came from when doing downstream processing. I'm imagining a use case where you generate a report from a tree of files, modify the report via XSLT, then use doctype-tool to reapply the doctypes the files.

Note that entities can contain other unexpanded entities as well as xml. I feel like there will be some subtle issues to consider with nested entities, namespaces, and whitespace.

    <!ENTITY foo "<p>this is a
    para with whitespace that we can't mess with. And a random other &bar; entity
    for good measure.
    </p>">
    <!ENTITY bar "<mml:math xmlns:mml='http://www.w3.org/1998/Math/MathML'/>">

See also https://en.wikipedia.org/wiki/Billion_laughs_attack

dwcramer commented 2 years ago

And of course once I have <externalEntity name="foo" systemId="c:/temp/bar.ent"/> I'll also want to be able to turn the contents of c:/temp/bar.ent into a report, but that means you would have to take catalog files into account...

AndrewSales commented 2 years ago

Good points all, @dwcramer - thanks. I'll set to work on this and see where it leads.