Rename the CommonMark DTD `html` element ***pleeeeease***!

I have done so in my clone of cmark, and let the parser now use block_html as the name for this kind of node, in analogy to the existing inline_html generic identifier.

Not that I like the new name better than the old (IMO, they are too specific anyway), but it solves an obvious problem: every HTML and XHTML document on the planet (and in orbit on the ISS ;-) has an element of this name; and therefore you can't bring "native" CommonMark elements even into close proximity to regular HTML or XHTML documents (or content, or parse trees, ASTs, whatever you call it, for that matter).

Not to mention that this use of <html> is pretty confusing for humans too, like for everyone who has ever klicked on the "show document source" button in his browser and took a look.

I assume that the CommonMark DTD is not set in stone (yet), or are there important consumers of the cmark -t xml output (other than regression tests)? While there are other things I'd like to see changed in it, this one is---at least for me---absolutely crucial.

[Using the cmark -t xml output for verification and testing purposes has it's own problems; but that's a topic for another conversation I'd be glad to have.]

Best regards tin-pot

+++ Martin Hofmann [Oct 25 15 08:30 ]:

Not that I like the new name better than the old (IMO, they are too specific anyway),

Do you mean that you'd favor something like

<raw_block format="html">

Me too, actually.

but it solves an obvious problem: every HTML and XHTML document on the planet (and in orbit on the ISS ;-) has an element of this name; and therefore you can't bring "native" CommonMark elements even into close proximity to regular HTML or XHTML documents (or content, or parse trees, ASTs, whatever you call it, for that matter).

I'm not really sure how you are imagining them being combined. In a CommonMark HTML document, there's no risk of confusion, because any HTML <html> tag inside a raw HTML block will be escaped. In an HTML document, you wouldn't find CommonMark XML. So, what is the potential problem you see?

Do you mean that you'd favor something like <raw_block format="html">

Yes, exactly! I'm currently using an invented element type name (GI) "mark-up" for such things: character content that is to be processed by a specific tool---be it the CommonMark processor, maybe my Z Notation mark-up processor, or eventually the plain old browser.

I'd be okay with the raw_block name too, alas it is only a "raw block" from the CommonMark perspective---otherwise, it contains just a bunch of text in a specific notation (HTML, or generally XML/SGML in this case).

Me too, actually.

Great. About the attributes of such a thing: in my "laboratory", I currently use

notation: for the "kind-of-syntax" inside, which is probably synonymous with your format attribute. I borrowed the term notation from SGML, where it is used in exactly that sense: a label to describe character content inside elements which is not to be parsed and processed by the SGML parser. There is special DTD syntax to introduce such things; but the result is just the same: an "arbitrarily" named attribute contains a "tag" for the syntax inside.
display: with the only two values block and inline; that one is borrowed from CSS (ok, there it is a property, not an element attribute): indicating the requirement to treat this element either as a block (like the generic example <DIV>), or inline (like the generic example <SPAN>). (A couple of days ago I expected to name that attribute mode, but I prefer a CSS-"compatible" terminology now. When it comes to "inventing" rendering parameters, I'd always turn to CSS first for obvious reasons.)

Note that this attribute would eliminate the need to have both block_html and inline_html (and vice versa for code and quote) elements in CommonMark, for example).
label: an all-purpose attribute that could also be called infostring (or infostr!), or options, or pi for "processing instructions": it is meant for the respective processor (parsing the given syntax and transfroming it into structured text, ie element sequences, ie XML fragments); and one obvious source for the content of this attribute would be the "info string" attached to a fenced code block in CommonMark.

I'm not really sure how you are imagining them being combined. In a CommonMark HTML document, there's no risk of confusion, because any HTML <html> tag inside a raw HTML block will be escaped.

That's true: the cmark -t xml mode "protects" the (supposed?) HTML content inside the (currently called) html rsp inline_html elements, by "escaping" or entity-substituting "<" etc.

But that's no longer the case when we're not talking about HTML documents any more: I actually do use CommonMark elements and others (eg from HTML) combined, and do it not just in my imagination any longer ;-) --- Which brings us to the next question:

In an HTML document, you wouldn't find CommonMark XML. So, what is the potential problem you see?

It may well be that nobody came up with an idea as perverted as mine, but for my little project this tag name clash was the first and in fact---so far---the only show stopper.

Kidding aside: I'm today in fact more convinced that my concept is sound than ever (which in this case means: "as I was a couple of days ago…" ;-)

Let me explain the situation where the clashing html name turned out to be a problem: So far I have enough "little tools" to compose a pipeline like this one out of them:

cat input.xml | xmlin | cmark_filter | xmlout > output.xml

As you can see, at the input end of the pipeline some XML document is pushed in. The chain of tools does the following:

xmlin parses the XML input into an internal format (representing the input document's structure and contents, aka the ESIS, or the "XML InfoSet"). It uses the Expat XML parser library for this, but is nevertheless pretty lightweight: right now it is just under 100 KB for a all-in- all statically linked executable, needing neither Expat nor a C Runtime Library in a DLL.
cmark_filter is a "filter" process, based on the reference implementation's parser, which extracts plain text inside the XML content (wrapped in the mentioned <mark-up> elements, and tagged as character content in CommonMark syntax), parses it, generates XML elements from it, and then replaces the digested <mark-up> element with the resulting XML fragment. Which then gets passed on (via stdout) to the next tool in the pipeline:
xmlout is the "output end" of the pipeline: it transforms the internal format back into well-formed XML and (currently) does some other things too:
- translating CommonMark's "native" XML elements into HTML,
- or into XHTML,
- or "translate" the "native" elements into an SGML, document, governed by SGML DTD which does not exist yet.

Why all this effort to generate HTML or XHTML from CommonMark syntax, which the cmark and similar programs do routinely every day?

The answer is: flexibility.

Because the "internal" representation of XML content is the output format of the sgmls family of parsers, the input end of the pipeline can also be:

nsgmls input.xml 2>sg.err | cmark_filter | ...

nsgmls input.sgml 2>sg.err | cmark_filter | ...

where the input documents can be any XML or SGML document, using the full features of XML and SGML (like minimized mark-up, external entities, whatnot), and not just the rather simple form that xmlin can process (I actually don't know exactly which features could or could not be used in it's XML input right now ...).

And obviously another goal and property of this concept is that one could stuck in other processing tools too: in the near future, I will adapt my Z notation processor in the same way as I did with cmark for cmark_filter, to compose a pipeline like this one:

cat mkd-z-doc.xml | xmlin | cmark_filter | z-markup | xmlout > ...

and so on.

Of course one isn't bound to push XML or SGML documents into the front end: another tool that already exists is the plain- text analogon to xmlin, unsurprisingly called txtin, and used like this:

cat typescript.md | txtin | cmark_filter | ...

It really does nothing exciting but simply wraps the plain text content into one big <mark-up> element. But this is enough to allow processing "conventional" plain-text files through the same pipeline. (The txtin executable to do this is a 48 KB "big" Win32 executable, again statically linked.)

An generating, say: L_AT_EX instead of XML/XHTML etc, would be effected by replacing xmlout by something like texout, and so on. [ EDIT: _It looks like one could readily use sgmlsasp to turn the internal representation into_ L_AT_EX: probably one tool less to develop here :-) ]

At the output end, or near it, would also be the place to put document-oriented tools like a "table of contents-generator", or an "index collector" tool, or what you have---independent of the front end syntax like CommonMark, but dependent on the ability to recognize "hierarchy elements" in it's input stream: which are either <H1> etc from XML/HTML or <header> from CommonMark, and so on.

I think you can see that "possibilities are endless", as I would say in a rush of marketing euphoria: at the very least, there are some new ways of processing CommonMark made possible by this concept.

So that is the scenario where the name clash between the CommonMark element and the HTML root element turned out to be problem---but I don't want to bore you with the exact details of how and why (unless you ask for it ;-)

Kind regards

tin-pot

PS: And that's the background too why I'd prefer element type names in the CommonMark DTD which are usable with the SGML reference concret syntax too, without requiring another specific, custom-made SGML declaration:

Names are 8 or less characters long,
starting with an (uppercase or lowercase) letter A..Z;
followed by letters, decimal digits, or . and -;
element names are not case-sensitive (like in HTML), but
entity names (like > in HTML) are.

(If that seems too restrictive, ~~could the names at least conform~~[ I just looked it up and it turns out that the HTML declaration already allows "_" in names, and it also places no limit on name length---so please disregard my nagging here… ] to the SGML declaration of HTML, which is at least readliy available ...?)

But that's a preference of mine; I would be glad if you'd take it into account, but I could live without it as well (in contrast to the case of the html name!).

PPS: You can find the sources for the mentioned tools in my repository; they use a new library of mine called libesis to read and write the "internal representation" of document content (ie the ESIS), which is identical to the nsgmls output format.

+++ Martin Hofmann [Oct 26 15 11:55 ]:

So that is the scenario where the name clash between the CommonMark element and the HTML root element turned out to be problem---but I don't want to bore you with the exact details of how and why (unless you ask for it ;-)

In all those words, I still didn't get an explanation of why having an element called <html> in the CommonMark xml is a problem for you. And I'm still having trouble seeing how it would be, from the generic explanation of the workflow you gave.

I'm sorry: I'll try to give you the details why this problem occured:

Obviously the processors in the pipeline must forward some representation of the "document" from stage to stage, where each stage (ie processor) does it's own kind of "little" transformation: think of replacing plain text fragments with XML fragments in the cmark_filter case.

For (I think: good) reasons I decided to not equip each one of the tools with the ability to parse XML, let alone to parse SGML.

Instead, the "document" content flows from stage to stage in an "internal" representation: in the output format of the nsgmls parser.

Using this format has---in my view, as well as in my experience so far---several advantages:

It is a text-based format, which can use UTF-8 (and in my case, does always use UTF-8) for character encoding: no tool should need to transcode text by itself;
it is a very simple format, as the one-page description of it shows;
it can be very easily generated (look up the implementation of ESIS_Start(), ESIS_Cdata(), and so on in the libesisio libraray if you want to know all the details);
it is also pretty easy to "parse" on input (see the ESIS_ParseFile() and ESIS_FilterFiles() functions in the library);
and last but not least: due to the nature of this format, there is a strict separation between structure (ie start and end of elements, their nesting, and the element's attributes) and content (ie character data in SGML/XML parlance).

The last point implies: there is no "mark-up", and there is no need to "escape" any mark-up "active" characters like "<". (The only and single exception is the end-of-line marker: being a line-oriented format, occurances of this character U+000A must be "escaped", and in fact are "escaped" to the familiar "\n" escape sequence, which in turn implies that backslash U+005C is also escaped, to "\\". But that's it, and it has nothing to do with XML's mark-up.)

As a consequence, a tool like txtin need only to "escape" the new-line character, and can simply put the plain text as it is into the internal representation.

And cmark_filter does similarly for "foreign" content, like content

in fenced or indented plain text input,
as well as inline or block HTML input text.

The first and only place where "escaping" the XML mark-up characters "<" and so on comes into play is at the end of the pipeline: where xmlout mixes the character content and the structure (into the form of XML mark-up) together again into a single text file in regular XML format.

This simplifies things a lot in my experience so far, but another consequence of all this is: the HTML content which cmark_filter passes on in it's output looks exactly the same as the root element from an (X)HTML document that happens be the first input into the pipeline: it is a regular element, with the GI "html", and when they arrive at xmlout, the html element emanating form cmark_filter and the root html element that was already in the HTML input look nearly the same:

the content of the (X)HTML document's root element was parsed already by xmlin (or sgmls), and has element structure,
whereas the content of an "html" element output by cmark_filter has only character content, but marked-up as HTML.

What xmlout does is rather a form of not doing anything: in the first case, the element structure inside the html root element is "transformed" into HTML tags, but in the second case too: xmlout transforms the start and end of the CommonMark html element into <html> and </html> start and end tags, and would transform the character content found inside it into the HTML by "escaping" it, just as it does for character content found inside a "real" "<html>" root element.

Which would mean that the character content generated by CommonMark inside the html element either

gets "entity-encoded" one time too many (if cmark_filter does not entity-encode it, but xmlout would and must do so); or
even gets "entity-encoded" two times too many (if cmark_filter would already have "entity-encoded" the character content inside html, as cmark does, and then xmlout would and must do the same again).

The only solution I can see for this dilemma, and a generic solution too, is that cmark_filter

uses a specific element for HTML mark-up found in the CommonMark input,
wraps the plain-text input fragment classified as (block or in-line) HTML verbatim inside this element, without any "escaping",
and xmlout or whatever does the un-packing of the internal format does know about the specific element, namely our <mark-up> element, and will not "entity-encode" it's character data content.

And the generic solution is that xmlout or a similar tool can see that the content inside <mark-up> needs not processing at all, because the attribute type (or syntax or whatever name is used for it) says so explicitly: "this piece of text is in HTML format".

The same would hold for cmark_filter itself, but kind-of in reverse: this filter will only process the plain-text character content of <mark-up> elements where it is announced that "this piece of text is in CommonMark syntax" in the same attribute of <mark-up>.

I'm very sorry about my long-winded explanations, but the what happens when using the chosen format for passing document content from tool to tool is in fact rather different from what would have to happen if regular XML marked-up format would be used.

On the other hand---thinking about it: why does cmark -t xml entity-encode HTML fragments in it's "native" XML output, but cmark -t xhtml for example will not?

If your answer is "in order to keep the output XML conforming to the CommonMark DTD" (which is the only reason I can think of): there you have your case where "mixing" HTML and CommonMark element types would seem rather natural, in my opinion. And is exactly what happens in the "mark-up processing pipeline" implementation.

I hope that this explanation does make some sense, and I'd like to thank you for your interest in my "use case".

The XML output should simply use an XML namespace. This would avoid any ambiguities between element names.

+++ Nick Wellnhofer [Dec 20 15 06:02 ]:

The XML output should simply use an XML namespace. This would avoid any ambiguities between element names.

True, but it would also make the output more verbose and harder to read. Given the purposes this XML format serves, is it really necessary?

If you specify the default namespace, the output does not change except for the one attribute:

<document sourcepos="1:1-11:0" xmlns="http://commonmark.org/xml/">
  <header level="2" sourcepos="1:1-1:17">

In this example, the document and header elements belong to the namespace without it being explicitly defined as cmark:document (which corresponds to xmlns:cmark="...").

Well, this seems a no-brainer then. I can modify the c and js implementations to add the xmlns attribute. What needs to be done in the DTD itself?

+++ Kārlis Gaņģis [Dec 22 15 12:05 ]:

If you specify the default namespace, the output does not change except for the one attribute:

In this example, the document and header elements belong to the namespace without it being explicitly defined as cmark:document (which corresponds to xmlns:cmark="..."). — Reply to this email directly or [1]view it on GitHub. References 1. https://github.com/jgm/cmark/issues/87#issuecomment-166717323

Has this come to any conclusion that validates against the DTD? I am especially interested in the suggestion to use Xml namespaces - which does not appear to be possible, for DTDs have no notion of namespaces and thus CommonMark.dtd will not accept custom xmlns attributes added somewhere in the document.

Excuse me if I chime in for a moment.

If you want to abide by these constraints:

having DTD declarations for "customization" elements and
a customized parser intended to produce output that is
valid (in the W3C XML sense) wrt this DTD,

then the best you can do, as far as I can see, is to separate out the "customization" parts from the "fixed" CommonMark DTD parts, using hacks - or "techniques" - like this. (Beware, untested!):

In some customize.dtd:

<!-- Declare elements for custom block and inline content -->

<!ELEMENT my:bookmark EMPTY>
<!ATTLIST my:bookmark
          xmlns:my    CDATA #FIXED     "http://my.example.org"
          bookmarkId  ID    #REQUIRED
          title       CDATA #IMPLIED >

<!-- Specify content model alternative for custom block -->

<!ENTITY % cust.block "my:bookmark" >

<!-- Specify content model alternative for custom inline -->

<!-- None given, use default -->

(I think one could extend this scheme to also add "customized" attributes on the <custom_block> and <custom_inline> elements.)

In the CommonMark.dtd:

<!-- Draw in customization part of DTD -->

<!-- **Comment out these two lines if not needed** -->
<!ENTITY % cust SYSTEM "customize.dtd">
%cust;

... <!-- regular DTD stuff, define `inline` PE etc  --> ...

<!-- Provide defaults if no customization was given -->
<!ENTITY % cust.block  "dummy" >
<!ENTITY % cust.inline "dummy" >

<!-- Content model might be customized in %cust; ... -->
<!ELEMENT custom_block ((%inline;|%block;|item)* | %cust.block; ) >

<!-- Content model might be customized in %cust; ... -->
<!ELEMENT custom_inline ((%inline;)* | %cust.inline; ) >

If I understand this comment right, then these "customized" elements occuring in the parser output are produced by a special-purpose parser to begin with in the given scenario, so their nature and number won't change that often, right?

Alternatively, one could simply declare in CommonMark.dtd:

<!-- Draw in customization elements for use in <custom_block> -->

<!-- **Comment out these two lines if not needed** -->
<!ENTITY % cust SYSTEM "customize.dtd">
%cust;

<!-- Content can also contain elements from %cust; ... -->
<!ELEMENT custom_block ANY >

You could even let the parser place the first part into the internal subset of the output XML document:

<?xml version="1.0" charset="utf-8">
<!DOCTYPE document SYSTEM "CommonMark.dtd" [
  <!-- Draw in customization elements -->
  <!ENTITY % cust SYSTEM "customize.dtd">
  %cust;
]>
<document>
  ...
</document>

This would allow the CommonMark.dtd to be agnostic about any external entities containing "customization" element declarations, and how to name and include them.

But in any case, this ANY hack wouldn't help much if you wanted meaningful validation ;-)

Other alternatives could be using not (only) a (non-validating) XML parser but a "WebSGML" parser, or of course to validate not (only) based on DTD notation, but some other DSDL like XML Schema, RELAX-NG or Schematron ...

Note that the original purpose of the <raw_block> element (or whatever name one prefers) that I tried to describe in my walls of text above was quite different: This was a scenario where

the parser's output DTD is fixed (and not customizable),
the XML output is always valid wrt to this DTD, and
the parser itself is "off-the-shelf" and fixed.

The goal was - among others - to write "foreign notation" in the input text like this:

Here are *important* formulae: `tex|0<1` true, `eqn|1 over 0` undefined.

And have this output as <raw_block> elements:

<text>Here are </text>
<emph>
  <text>important</text>
</emph>
<text> formulae: </text>
<raw_block notation="tex" display="inline">0&lt;1</raw_block>
<text> true, </text>
<raw_block notation="eqn" display="inline">1 over 0</raw_block>
<text> undefined.</text>

Here the <raw_block> element has always only character data content (though XML requires the content model to be declared as (#PCDATA) instead of just CDATA), so the issue with customized content models, additional element types and their names (and hence XML namespace use) does not occur.

Actual parsing and processing of this character data content is then done by one or more post-processors (associated with the notation attribute) working in pipeline fashion.

@fhaag I did as @Knagis suggested and added the xmlns attribute to the document element in the XML output. Also, html became html_block. Is this not sufficient? I don't see a problem anywhere.

@tin-pot Although there is a way to add raw content using custom_block and custom_inline, I'm fairly sympathetic to the idea of having raw_block and raw_inline elements that take attributes and CDATA. Indeed, if we had these we could get rid of custom_block, custom_inline, html_block, and html_inline.

However, if there are issues remaining, someone should open a new issue that targets them specifically. And it should be on jgm/CommonMark, not here -- this is for the C implementation.

@tin-pot That's an interesting hack. Users are probably restricted in their choice of a namespace prefix, and while not very standard-like, it may be acceptable.

However, it seems that all custom elements need to be declared in DTD format, as well - is that true? I suppose there is no way to use custom undeclared elements (undeclared as in, their formal declaration is not available at the time of processing the document), or also elements declared in an Xml Schema? I am not so much concerned about actually conducting validation, as about writing Xml that is tidy in the sense "If all schemas were available, it would validate."

For my current use-case, the solution that makes CommonMark.dtd agnostic about which other DTDs are going to be included sounds somewhat viable, although it would still require the DTD-like declaration of all custom elements.

I did as @Knagis suggested and added the xmlns attribute to the document element in the XML output. Also, html became html_block. Is this not sufficient? I don't see a problem anywhere.

@jgm This works fine for as long as we are dealing with a pure CommonMark AST document:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">
<document xmlns="http://commonmark.org/xml/1.0">
  <paragraph>
    <text>abc </text>
  </paragraph>
</document>

However, as soon as you only do as little as add an additional namespace declaration¹ (which is, after all, the point of using namespaces in the first place - to use two or more at a time and thereby distinguish elements that would otherwise have the same name), validation against the DTD fails:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">
<document xmlns="http://commonmark.org/xml/1.0" xmlns:html="http://www.w3.org/1999/xhtml">
  <paragraph>
    <text>abc </text>
  </paragraph>
</document>

That is because, from the point of view of a DTD, xmlns and xmlns:... attributes are ordinary attributes that have to be declared in the DTD, not something special that users can add at will to import foreign definitions.

¹: I chose the HTML namespace because the central issue of this thread's OP seemed to be that they want to use HTML's <html> element somehow in combination with CommonMark AST Xml documents.

However, if there are issues remaining, someone should open a new issue that targets them specifically. And it should be on jgm/CommonMark, not here -- this is for the C implementation.

Fair enough, and I would like to follow up on any comments more specific to my concrete situation in the thread I opened. The only reason why I somewhat revived this thread is because I was not quite sure what to concretely deduce from the apparently simple conclusion

Well, this seems a no-brainer then.

OK, I think I now finally understand the point about the namespace.

From the point of view of CommonMark, I think it's fine if raw HTML has to be encoded, so I don't think anything here needs reopening.

That's an interesting hack. Users are probably restricted in their choice of a namespace prefix, and while not very standard-like, it may be acceptable.

Yes, XML Namespaces came after the XML specification, and have semantics that can't be reproduced in SGML. (Remember that XML is "an application profile or restricted form of SGML" - you can find all the subtle differences expertly described here). So DTDs and XML Namespace don't go together very well.

The usual workaround for using DTDs and namespaces together if needed is to settle for unique and constant prefixes (like myns), and include those (in QName element type names like myns:myelem) and any required attributes for NSAttNames (xmlns:myns or xmlns) in the DTD. This works as long as the XML document instance always uses the exact same GIs (SGML parlance) rsp QNames (XML Namespace parlance) as spelled out in the DTD, that is: either a PrefixedName <myns:myelem> or a UnprefixedName (ie LocalName) <myelem>, but is not mixing both.

Nowadays, a lot of different DTD notations (ISO 8879:1986/Cor.2:1999 speak) aka Document Schema Definition Languages (DSDLs, ISO/IEC 19757-1) like XML Schema (W3C), RELAX-NG (ISO/IEC 19757-2), Schematron (ISO/IEC 19757-3) are en vogue and can be used as alternatives to DTDs, alone or in combinations. There is even explicit support for "namespace-based validation" through a "Namespace-based Validation Dispatching Language" (NVDL, ISO/IEC 19757-4).

However, it seems that all custom elements need to be declared in DTD format, as well - is that true?

I think so, yes. But it might depend on your specific validator implementation and setup whether you can mix a DTD with, say, declarations given in some XML Schema. At least rxp, the DTD-validating XML parser I use, does require all the elements being declared in (some included part of) the DTD.

I suppose there is no way to use custom undeclared elements (undeclared as in, their formal declaration is not available at the time of processing the document),

Of course you can always process your XML document without a DTD or any formal declarations, it just has to be well-formed.

In James Clark's article I mentioned, you can find in the second SGML declaration (the one where "Web SGML Adaptations Annex to ISO 8879" appears in the heading, which means ISO 8879:1986/Cor.2:1999) the fragment

        IMPLYDEF
             ATTLIST YES
             DOCTYPE YES
             ELEMENT YES
             ENTITY YES
             NOTATION YES

These are feature options that tell a conforming parser (like OpenSP) to "imply" certain declarations in case they are missing in the DTD but needed for the document instance. For example, a missing declaration for an element type <myelem> would be implied to be

<!ELEMENT myelem - O ANY>

and the element in the document instance then parsed accordingly. But I'm not sure if that is what you are after ...

or also elements declared in an Xml Schema? I am not so much concerned about actually conducting validation, as about writing Xml that is tidy in the sense "If all schemas were available, it would validate."

If you know what declarations "all schemas" would actually contain, wouldn't that amount to requiring the document to be "type-valid", that is: adhering to all requirements expressed in these schemas - or do I missunderstand your "if" here?

Btw: are you writing XML or are you talking about the cmark-generated XML output that's supposed to be valid?

For my current use-case, the solution that makes CommonMark.dtd agnostic about which other DTDs are going to be included sounds somewhat viable, although it would still require the DTD-like declaration of all custom elements.

Depending on how often you have to add or modify those custom elements, this may well work. I sure hope so and would be glad if my comments helped ... :-)

[...] because the central issue of this thread's OP seemed to be that they want to use HTML's element somehow in combination with CommonMark AST Xml documents.

Kind of. While Markdown text could always include "HTML tags" and "HTML blocks", my setup takes the opposite direction: I want to have Markdown text as the character data content of some "customized" elements inside, for example, a HTML or DocBook document. Technically this amounts to using Markdown as a notation for character data. For example (this is not valid XHTML!):

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC
     "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" [
 <!NOTATION mkd PUBLIC "+//IDN commonmark.org//NOTATION CommonMark//EN">
 <!ELEMENT mark-up (#PCDATA)>
 <!ATTLIST mark-up
           display  (block|inline)          "inline"
           label    CDATA          #IMPLIED
           notation NOTATION (mkd) #FIXED   "mkd"    >
]>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head><title>Markdown Example</title></head>
<body><p>Some example:</p>
<mark-up display="block"><![CDATA[

# A _Markdown_ example # 

This will **hopefully** be:

-   parsed as _Markdown_,
-   transformed into an AST,
-   rendered as XHTML.

]]></mark-up></body></html>

[This uses Markdown (mkd) as the single possible, and fixed, Notation used inside <mark-up> elements; it's obvious how this could accomodate additional Notations like "comma-separated values" (CSV), or Textile, or Creole, or EBNF (ISO/IEC 14977), or Z Notation (ISO/IEC 13568) etc.]

And during processing there are stages where (the equivalent of) well-formed, but un-typed, XML documents containing both the "host" element types and the CommonMark element types are handled. XML Namespaces would provide a solution for the potential name clashes, but not the only or simplest one.

This processing will, in the end, replace the <mark-up> element and all "native" CommonMark elements by the corresponding HTML (or DocBook or what you have) element(s) - that's basically the whole point.

commonmark / cmark

Rename the CommonMark DTD `html` element pleeeeease! #87

commonmark / cmark

Rename the CommonMark DTD `html` element ***pleeeeease***! #87

Rename the CommonMark DTD `html` element pleeeeease! #87