Closed fhaag closed 5 years ago
+++ Florian Haag [Nov 02 16 03:31 ]:
attributes. So, this is a valid fragment of such an AST Xml document:
Hello, World ! Then, how can I indicate which custom inline element is used there? The custom_inline element does not seem to allow for any custom attributes or contents that can help distinguish different custom inline elements (let alone pass any other configuration for the custom inline). I see I can (ab?)use the on_enter and on_exit attributes to store whichever information I like, but that feels hacky.
The conception was that custom elements were just a way to insert some literal content for a specific output format, kind of like RawBlock and RawInline in pandoc.
But to make this more flexible, we allowed them to contain nodes, and added two places where literal content could be inserted.
There's an example for syntax highlighting in lcmark (a lua wrapper): https://github.com/jgm/lcmark/blob/master/filters/highlight.lua Here the custom block is just used to insert some literal content in the target format (on_enter is used for this).
An example where you might want the full power:
Suppose you wanted to emulate pandoc's treatment of an
image alone in a paragraph as a figure. You could write
a filter that finds images alone in their paragraphs,
and replaces them with custom nodes whose node content
is the same as the image's (this would be the image
description), but with special literal text before
(on_enter) and after (on_exit), e.g. a <figure>
element
in HTML, or a figure environment in LaTeX.
Suppose you wanted to emulate pandoc's treatment of an image alone in a paragraph as a figure. You could write a filter that finds images alone in their paragraphs, and replaces them with custom nodes whose node content is the same as the image's (this would be the image description), but with special literal text before (on_enter) and after (on_exit), e.g. a
<figure>
element in HTML, or a \figure environment in LaTeX.
I see ... I'm not targetting a text-base output format, so it seems I need to encode a structured information container and put it either into on_enter
or on_exit
, from where I can unwrap it while reading the AST document. For instance, let's say I support a special link
[place a bookmark](bookmark:abc "Click here to place a bookmark")
in CommonMark-conformant Markdown that, rather than opening a URL, will set a bookmark in the host system with identifier abc
. I can recognize this special type of link based upon the destination, but I want to do so in the stage of parsing Markdown to AST, not afterwards. Therefore, I need to store the parameters of the special link in a structured format, such as Xml:
<?xml version="1.0" encoding="UTF-8"?>
<my:bookmark xmlns:my="http://my.example.org" bookmarkId="abc" title="Click here to place a bookmark"/>
I cannot merge the relevant fragment directly into the AST Xml (as the DTD does not allow that), like so (by describing my special link with my own element from a custom namespace):
<?xml version="1.0" encoding="UTF-8"?>
<document xmlns="http://commonmark.org/xml/1.0" xmlns:my="http://my.example.org">
<paragraph>
<my:bookmark bookmarkId="abc" title="Click here to place a bookmark">
<text>place a bookmark</text>
</my:bookmark>
</paragraph>
</document>
Or maybe, maybe, like this (with an extra <custom_data>
element that accepts arbitrary content):
<?xml version="1.0" encoding="UTF-8"?>
<document xmlns="http://commonmark.org/xml/1.0">
<paragraph>
<custom_inline>
<custom_data>
<my:bookmark xmlns:my="http://my.example.org" bookmarkId="abc" title="Click here to place a bookmark"/>
</custom_data>
<text>place a bookmark</text>
</custom_inline>
</paragraph>
</document>
Instead, I have to wrap my custom Xml document into one of the attributes of the <custom_inline>
element:
<?xml version="1.0" encoding="UTF-8"?>
<document xmlns="http://commonmark.org/xml/1.0">
<paragraph>
<custom_inline on_enter="<?xml version="1.0" encoding="UTF-8"?><my:bookmark xmlns:my="http://my.example.org" bookmarkId="abc" title="Click here to place a bookmark"/>">
<text>place a bookmark</text>
</custom_inline>
</paragraph>
</document>
This seems somewhat ... ugly, but if it is truly the recommended way, I will consider using it.
Agreed, that's ugly. As I explained, custom_inline/block were added to provide a way to pass through raw content, not to provide a flexible way of adding custom node types.
I suppose that if we wanted to make them more flexible, we could consider adding some mechanism for inserting custom attributes to a custom node. This would add a lot of complexity in the C implementation, but I suppose we could add this to the DTD without adding support for it to the C implementation.
If you could change the DTD to add this, how would you do it?
I'd be curious whether others have thoughts on this.
If you could change the DTD to add this, how would you do it?
Thank you for the explanations. I don't know much about authoring DTDs, as I usually use Xml Schema for specifying Xml formats. From looking around on the web, though, it would seem that both of my proposed "tidier" Xml documents might prove problematic with DTDs:
For one, Xml namespaces are not really supported by DTDs, as "XML DTD syntax predates XML namespaces". DTDs can declare something that looks like a qualified Xml name, but it does not behave like one:
When given namespace "prefixes" in a DTD, on the other hand, the prefix part is simply considered part of the element name (since DTD has no namespace concept). Therefore, the "prefix" canNOT be altered and there is no notion of namespace URIs (nor a default namespace).
The CommonMark DTD and another, custom DTD could be used together (i.e. referred to from the same Xml document), but the custom DTD must be designed not to clash with the CommonMark DTD:
actually combining two DTDs can be a problem if they happen to use the same element or entity names for different purposes. DTD syntax was designed to allow you to combine sets of declarations designed for that purpose, but if they weren't so designed ... trouble.
At the same time, there does not seem to be a way to allow arbitrary undeclared Xml elements (in particular, elements from a custom namespace) as children of elements from the DTD (such as the <custom_data>
element I suggested above), as explained in a StackOverflow answer:
You came as close as a DTD can come, by using the ANY keyword. But ANY matches a mixture of #PCDATA and every element declared in the DTD. It doesn't accept undeclared elements
If the custom elements are declared in a DTD of their own, it can be imported as described above, as long as there are no clashes with CommonMark.dtd
. W3 deems it possible to combine a DTD and an Xml Schema, but as far as I can see there, they only use the entity definitions from the DTD, no elements - because I do not think there would be a way to have the DTD allow for custom namespace prefix declarations (in the xmlns:prefix="namespace"
form) to import Xml Schemas right into DTD-declared elements.
So, I second your curiosity whether there are any further thoughts on this, possibly also with respect to this seemingly related thread.
I was thinking custom attributes could be like data-*
attributes in HTML5.
But here I see:
There is no HTML5 DTD. The HTML5 RC explicitly says this when discussing XHTML serialization, and this clearly applies to HTML serialization as well.
DTDs have been regarded by the designers of HTML5 as too limited in expressive power, and HTML5 validators (basically the HTML5 mode of http://validator.nu and its copy at http://validator.w3.org) use schemas and ad hoc checks, not DTD-based validation.
Moreover, HTML5 has been designed so that writing a DTD for it is impossible. For example, there is no SGML way to capture the HTML5 rule that any attribute name that starts with “data-” and complies with certain general rules is valid. In SGML, attributes need to be listed individually, so a DTD would need to be infinite.
So I guess we can't have custom attributes. Perhaps the best we could do would be defining several generic attributes like attribute1
, attribute2
, etc.? That doesn't seem too nice.
So I guess we can't have custom attributes. Perhaps the best we could do would be defining several generic attributes like
attribute1
,attribute2
, etc.? That doesn't seem too nice.
Indeed - as amatter of fact, while this removes the need to invoke a nested Xml parser on the wrapped document, it makes the parsing code - and arguably, the embedded custom data - less readable as Xml names are not descriptive any more.
So, my options while purely relying on CommonMark.dtd
seem pretty limited. In the end, I am currently considering two options for my concrete case:
CommonMark.dtd
as suggested by @tin-pot in the other thread, some custom content could be included, if that custom content is also declared in DTD format, and without really making use of the namespaces concept that is otherwise inherent to Xml.CommonMark.dtd
, with the additional rule that they may contain certain application-specific elements. Obviously, this would break the ability to automatically vaidate the documents (or I might embed an Xml Schema that replicates what is defined in CommonMark.dtd
with the additional rule as a resource in my application ...), but at least the data format would be well-defined :-/
I am using CommonMark (concretely, the CommonMark.NET implementation) in a project that, at one point, stores the parsed AST as an Xml document. I am trying to keep these Xml files valid with respect to
CommonMark.dtd
.On top of the standard-conformant CommonMark elements, I use a handful of additional features for integrating the documents with the host application. Now, I'm struggling to see how to represent these additional elements in the AST Xml.
The DTD defines two Xml elements for custom content:
custom_block
custom_inline
Is there an example of how to use these? It seems, for instance,
<custom_inline>
expects other inlines as its content, and no additional attributes. So, this is a valid fragment of such an AST Xml document:Then, how can I indicate which custom inline element is used there? The
custom_inline
element does not seem to allow for any custom attributes or contents that can help distinguish different custom inline elements (let alone pass any other configuration for the custom inline). I see I can (ab?)use theon_enter
andon_exit
attributes to store whichever information I like, but that feels hacky.While we're at it, a cleaner way of achieving this would be to represent custom blocks and inlines with custom Xml elements from another namespace. Unfortunately, the
document
element does not permit the declaration of custom namespaces by means ofxmlns:...
attributes. Maybe that is where something could be changed?