commonmark / commonmark-spec

CommonMark spec, with reference implementations in C and JavaScript
http://commonmark.org
Other
4.89k stars 317 forks source link

How are `custom_inline` and `custom_block` supposed to be used? #433

Closed fhaag closed 5 years ago

fhaag commented 8 years ago

I am using CommonMark (concretely, the CommonMark.NET implementation) in a project that, at one point, stores the parsed AST as an Xml document. I am trying to keep these Xml files valid with respect to CommonMark.dtd.

On top of the standard-conformant CommonMark elements, I use a handful of additional features for integrating the documents with the host application. Now, I'm struggling to see how to represent these additional elements in the AST Xml.

The DTD defines two Xml elements for custom content:

Is there an example of how to use these? It seems, for instance, <custom_inline> expects other inlines as its content, and no additional attributes. So, this is a valid fragment of such an AST Xml document:

<custom_inline>
    <text>Hello, </text>
    <emph>
        <text>World</text>
    </emph>
    <text>!</text>
</custom_inline>

Then, how can I indicate which custom inline element is used there? The custom_inline element does not seem to allow for any custom attributes or contents that can help distinguish different custom inline elements (let alone pass any other configuration for the custom inline). I see I can (ab?)use the on_enter and on_exit attributes to store whichever information I like, but that feels hacky.

While we're at it, a cleaner way of achieving this would be to represent custom blocks and inlines with custom Xml elements from another namespace. Unfortunately, the document element does not permit the declaration of custom namespaces by means of xmlns:... attributes. Maybe that is where something could be changed?

jgm commented 8 years ago

+++ Florian Haag [Nov 02 16 03:31 ]:

attributes. So, this is a valid fragment of such an AST Xml document:

Hello, World !

Then, how can I indicate which custom inline element is used there? The custom_inline element does not seem to allow for any custom attributes or contents that can help distinguish different custom inline elements (let alone pass any other configuration for the custom inline). I see I can (ab?)use the on_enter and on_exit attributes to store whichever information I like, but that feels hacky.

The conception was that custom elements were just a way to insert some literal content for a specific output format, kind of like RawBlock and RawInline in pandoc.

But to make this more flexible, we allowed them to contain nodes, and added two places where literal content could be inserted.

There's an example for syntax highlighting in lcmark (a lua wrapper): https://github.com/jgm/lcmark/blob/master/filters/highlight.lua Here the custom block is just used to insert some literal content in the target format (on_enter is used for this).

An example where you might want the full power: Suppose you wanted to emulate pandoc's treatment of an image alone in a paragraph as a figure. You could write a filter that finds images alone in their paragraphs, and replaces them with custom nodes whose node content is the same as the image's (this would be the image description), but with special literal text before (on_enter) and after (on_exit), e.g. a <figure> element in HTML, or a figure environment in LaTeX.

fhaag commented 8 years ago

Suppose you wanted to emulate pandoc's treatment of an image alone in a paragraph as a figure. You could write a filter that finds images alone in their paragraphs, and replaces them with custom nodes whose node content is the same as the image's (this would be the image description), but with special literal text before (on_enter) and after (on_exit), e.g. a <figure> element in HTML, or a \figure environment in LaTeX.

I see ... I'm not targetting a text-base output format, so it seems I need to encode a structured information container and put it either into on_enter or on_exit, from where I can unwrap it while reading the AST document. For instance, let's say I support a special link

[place a bookmark](bookmark:abc "Click here to place a bookmark")

in CommonMark-conformant Markdown that, rather than opening a URL, will set a bookmark in the host system with identifier abc. I can recognize this special type of link based upon the destination, but I want to do so in the stage of parsing Markdown to AST, not afterwards. Therefore, I need to store the parameters of the special link in a structured format, such as Xml:

<?xml version="1.0" encoding="UTF-8"?>
<my:bookmark xmlns:my="http://my.example.org" bookmarkId="abc" title="Click here to place a bookmark"/>

I cannot merge the relevant fragment directly into the AST Xml (as the DTD does not allow that), like so (by describing my special link with my own element from a custom namespace):

<?xml version="1.0" encoding="UTF-8"?>
<document xmlns="http://commonmark.org/xml/1.0" xmlns:my="http://my.example.org">
    <paragraph>
        <my:bookmark bookmarkId="abc" title="Click here to place a bookmark">
            <text>place a bookmark</text>
        </my:bookmark>
    </paragraph>
</document>

Or maybe, maybe, like this (with an extra <custom_data> element that accepts arbitrary content):

<?xml version="1.0" encoding="UTF-8"?>
<document xmlns="http://commonmark.org/xml/1.0">
    <paragraph>
        <custom_inline>
            <custom_data>
                <my:bookmark xmlns:my="http://my.example.org" bookmarkId="abc" title="Click here to place a bookmark"/>
            </custom_data>
            <text>place a bookmark</text>
        </custom_inline>
    </paragraph>
</document>

Instead, I have to wrap my custom Xml document into one of the attributes of the <custom_inline> element:

<?xml version="1.0" encoding="UTF-8"?>
<document xmlns="http://commonmark.org/xml/1.0">
    <paragraph>
        <custom_inline on_enter="&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;&lt;my:bookmark xmlns:my=&quot;http://my.example.org&quot; bookmarkId=&quot;abc&quot; title=&quot;Click here to place a bookmark&quot;/&gt;">
            <text>place a bookmark</text>
        </custom_inline>
    </paragraph>
</document>

This seems somewhat ... ugly, but if it is truly the recommended way, I will consider using it.

jgm commented 8 years ago

Agreed, that's ugly. As I explained, custom_inline/block were added to provide a way to pass through raw content, not to provide a flexible way of adding custom node types.

I suppose that if we wanted to make them more flexible, we could consider adding some mechanism for inserting custom attributes to a custom node. This would add a lot of complexity in the C implementation, but I suppose we could add this to the DTD without adding support for it to the C implementation.

If you could change the DTD to add this, how would you do it?

I'd be curious whether others have thoughts on this.

fhaag commented 8 years ago

If you could change the DTD to add this, how would you do it?

Thank you for the explanations. I don't know much about authoring DTDs, as I usually use Xml Schema for specifying Xml formats. From looking around on the web, though, it would seem that both of my proposed "tidier" Xml documents might prove problematic with DTDs:

So, I second your curiosity whether there are any further thoughts on this, possibly also with respect to this seemingly related thread.

jgm commented 8 years ago

I was thinking custom attributes could be like data-* attributes in HTML5. But here I see:

There is no HTML5 DTD. The HTML5 RC explicitly says this when discussing XHTML serialization, and this clearly applies to HTML serialization as well.

DTDs have been regarded by the designers of HTML5 as too limited in expressive power, and HTML5 validators (basically the HTML5 mode of http://validator.nu and its copy at http://validator.w3.org) use schemas and ad hoc checks, not DTD-based validation.

Moreover, HTML5 has been designed so that writing a DTD for it is impossible. For example, there is no SGML way to capture the HTML5 rule that any attribute name that starts with “data-” and complies with certain general rules is valid. In SGML, attributes need to be listed individually, so a DTD would need to be infinite.

So I guess we can't have custom attributes. Perhaps the best we could do would be defining several generic attributes like attribute1, attribute2, etc.? That doesn't seem too nice.

fhaag commented 8 years ago

So I guess we can't have custom attributes. Perhaps the best we could do would be defining several generic attributes like attribute1, attribute2, etc.? That doesn't seem too nice.

Indeed - as amatter of fact, while this removes the need to invoke a nested Xml parser on the wrapped document, it makes the parsing code - and arguably, the embedded custom data - less readable as Xml names are not descriptive any more.

So, my options while purely relying on CommonMark.dtd seem pretty limited. In the end, I am currently considering two options for my concrete case: