Separate package for Markdown AST

I flouted this idea a long time ago, but I would now like to open this up for more specific discussion and feedback.

In short, I think it would be useful to have some sort of an interface package for representing the Markdown AST (basically, the Node type, the AbstractContainer abstract types, and the specific container types; + APIs). At the same time decoupling it from the parser and writer details, to keep it stable and lightweight. I also imagine it would provide clean, documented APIs to then work with that AST (e.g. returning all children, accessing the node data; as far as I can tell, CommonMark currently doesn't really provide that many official APIs for this kind of stuff).

It would allow parsers and consumers (e.g. writers, Documenter) to pass Markdown AST back and forth via a standard interface. In principle, it could all also live here and achieve the same thing, but the AST would then be versioned together with changes to e.g. parsers and writers. I feel that having those things versioned separately would be good (the AST package would be very conservative, but CommonMark could still make rapid changes).

The design of the AST here (a single Node type that handles the structure of the AST), when compared to the Markdown standard library AST, feels much better (provides type stability, but also easily allows a lot of general operations on the whole tree without having to special case for each node type). As such, I think lifting the AST from here to a separate package and then sprinkling some APIs and documentation around it could largely be all that is needed. CommonMark, I imagine, would then depend on the AST package and use the AST types from there.

If this seems like a reasonable idea, I am volunteering to do the legwork. This would be a part of some refactoring and cleaning up of Documenter I would like to do anyway.

A few other thoughts:

I do think that the AST package should be minimal. Just the AST, no parsers or writers, which should stay here. The only writer that should probably be moved to the AST package is the terminal one, as pretty a representation of the AST in the terminal is probably useful for the users of the interface package.
I think it should also explicitly support implementing new nodes defined by the user (technically, just a new AbstractContainer, which can already be done). In particular, I would like to switch the internal representation of Documenter documents over to such a standardized AST. But then in the early parsing steps Documenter would replace e.g. the at-code-blocks with its own nodes.
Related to the both previous points: I would suggest that the set of node types in the external package would be the union of standard CommonMark nodes + Julia Markdown nodes. Any additional extensions would stay here.
The CommonMark Node type seems to contain some parser-specific information, which could be out of place in the interface package. So I am wondering if that would create some friction for using the interface package here directly. Generally, I can imagine three different ways to implement the interface package:
1. As said above, simply lift all the relevant types from here to a separate package, and maybe create some API for attaching custom metadata to a node that CommonMark could use?
2. The interface package could have its own Node implementation, but there would be some conversion functionality between the ASTs.
3. Finally, and I am not sure if this is even doable, but perhaps the interface package could just provide and abstract node type and define APIs, and users could then implement their own node types. Consumers would then just rely on the API.
Just as a quick side project, I might revive CMark.jl, which links against the libcmark and libcmark-gfm C parsers, and have it also produce this standardized AST. I don't think users should use the C parsers if they're parsing Markdown (CommonMark.jl is preferred), but this could be useful for testing.

100% behind making the representation of the AST more standalone, as well as the route you're proposing.

should probably be moved to the AST package is the terminal one

That one also needs to be re-written without use of Crayons, this is one of the main blocking issues if being able to upstream CM into the standard libs.

Just as a quick side project, I might revive CMark.jl, which links against the libcmark and libcmark-gfm C parsers, and have it also produce this standardized AST. I don't think users should use the C parsers if they're parsing Markdown (CommonMark.jl is preferred), but this could be useful for testing.

That would be quite nice to be able to cross-check against other implementations like that, would likely unearth some nice corner cases.

I now have put together a WIP version of a separate AST package. It needs heaps of corner-case fixing, general cleanup and more documentation, but I think the only major thing that is missing is the terminal printing.

However, it does deviate from the implementation here in some ways (I found it easier to implement these changes right away, rather than starting by verbatim lifting things from here). I would love to have your feedback on those changes, especially if you have objections or suggestions for alternatives.

The main differences from the CommonMark.jl implementation should be:

The Node type has a fewer fields, dropping a bunch of the code metadata fields, and is instead a parametric type. The type parameter can be used to add additional fields.

My assumption is that in CommonMark would have its own metadata container e.g.
```
mutable struct CommonMarkMeta
   sourcepos::SourcePos
   last_line_blank::Bool
   last_line_checked::Bool
   is_open::Bool
   literal::String
   meta::Dict{String,Any}
end

const CommonMarkNode = Node{CommonMarkMeta}
```
and would return Node{CommonMarkMeta} trees.
The "null" node in the pointers is represented by a nothing (as opposed to NULL_NODE), and the fields are Union{Node{M}, Nothing}.

This pattern feels more idiomatic to me in Julia, but I am not sure if there is e.g. a performance reason for using the NULL_NODE pattern instead. But I believe the small unions are performant and type-stable?
Accessing the various fields of a Node are now done via accessor method (e.g. next(), children(), node[] for the AbstractContainer instance, or meta() for the user-defined metadata; more in docs of Node). However, I don't think it would be unreasonable to stick to fields/properties for some of these APIs.
AbstractContainer -> AbstractElement as the top level abstract type. I felt that using the word "container" here was not quite accurate, since some of them are not containers, but leafs.

"Element" however, feels like a good general word referring to the different types of nodes. I borrowed it from HTML, but the CommonMark standard actually also refers to "structural elements" (e.g. "We can think of a document as a sequence of blocks—structural elements like paragraphs, ...").
Instead of append_child and prepend_child, I overload push! and pushfirst! for this. I felt that "append"/"prepend" could be confusing, as in the standard library append! and prepend! concatenate two collections, rather than adding an element. However, at the same time, I am not really sure it makes sense to think of a node as a "collection of its children", which this choice implies.
Some of CommonMark elements (AbstractContainers) use the .literal field for their data (e.g. Text). I think that all information about the element should be in the AbstractElement (AbstractContainer) object (to avoid having to go back to the Node). So e.g. Text() elements now have a .text field that contains the text content, and I think a few other elements have the same change.

Sorry again for the TL;DR :sweat_smile: There are plenty WIP items in the current implementation and I have opened a bunch of issues on that repo too, mainly keep track of stuff for myself.

Thanks for the updates.

The type parameter can be used to add additional fields.

Any idea yet whether that's going to require significant changes in CM.jl to switch, or will changes be minimal? Currently the struct mirrors the what is used in the JS and Python libs on which this is based, ideally I don't want to stray too far from there so that folding changes from there into this package is easier.

But I believe the small unions are performant and type-stable?

If it's just as fast then that's fine, when originally written it was required to avoid dynamic dispatch in many places.

node[] for the AbstractContainer instance

Perhaps container() rather than overloading getindex, which adds inconsistencies in how you access particular parts of the nodes.

e.g. next()

Probably too generic, unless we're expecting to not export?

AbstractContainer -> AbstractElement as the top level abstract type. I felt that using the word "container" here was not quite accurate, since some of them are not containers, but leafs.

That's fine.

Instead of append_child and prepend_child, I overload push! and pushfirst! for this. I felt that "append"/"prepend" could be confusing, as in the standard library append! and prepend! concatenate two collections, rather than adding an element. However, at the same time, I am not really sure it makes sense to think of a node as a "collection of its children", which this choice implies.

Those were intentionally not added to the push! and pushfirst! methods since I didn't feel they could really reasonably be classed as "array-like" enough for it not to be punning.

So e.g. Text() elements now have a .text field that contains the text content, and I think a few other elements have the same change.

Same comment about matching the upstream packages that this one is based on. This might be more borderline than changing the fields of the node type, so might be more acceptable.

I'll have more comments when I can get around to looking over the package itself.

Ah, I didn't consider keeping the code and API similar to the other implementations as a goal. The downside of sticking to those conventions is that it would restrict us in providing the cleanest, most Julian API.

However, I think we could get the best of both worlds with some getproperty/setproperty! magic. We can have whatever structure / API for Node we want, and then here just overload the method specific to CommonMark, e.g.

function Base.getproperty(node::Node{CommonMarkMeta}, name::Symbol)
    if name === :literal
        meta(node).literal
    elseif name === :nxt
        next(node)
        # etc.
    else
        # Fallback to the properties of general Node type. This is useful if we e.g. decide
        # to provide a .children property, without having to re-implement all the general
        # properties of Node for the specialized Node{CommonMarkMeta}.
        invoke(getproperty, Tuple{Node,Symbol}, node, name)
    end
end

to provide the necessary "fields" that can be used internally in CommonMark. This way we should be able to get away with really minimal changes to CommonMark's code.

I think Julia's constant propagation keeps the whole thing performant too. For example: ```julia struct Bar x :: Any end function Base.getproperty(value :: Bar, name :: Symbol) if name === :typeofx println(typeof(value.x)) elseif name === :y println("yyy") else getfield(value, name) end end function bbtest(b::Bar) b.typeofx x = b.x b.y return x end @code_llvm bbtest(Bar(123)) ``` The LLVM IR clearly shows that the `println` etc. are nicely inlined, rather than doing `if`s on every `getproperty` call.

So e.g. Text() elements now have a .text field that contains the text content, and I think a few other elements have the same change.

Same comment about matching the upstream packages that this one is based on. This might be more borderline than changing the fields of the node type, so might be more acceptable.

An option here could be to do some more complex overloading of the .literal field, which for certain element types returns/sets the value from the element, rather than from the .meta.literal field (or, perhaps even keeps both consistent).

But I believe the small unions are performant and type-stable?

If it's just as fast then that's fine, when originally written it was required to avoid dynamic dispatch in many places.

I am not super-informed about this, just something I've read in passing (e.g. e.g. this blog post). However, I constructed this test case:

@noinline _next(node::Node) = node.nxt
@noinline print_node(::Nothing) = println("- nothing")
@noinline print_node(::Node) = println("- node")
function print_node_dispatch(node)
    nxt = _next(node)
    print_node(nxt)
    print_node(nxt)
    if isnothing(nxt)
        print_node(nxt)
    end
end

@code_warntype print_node_dispatch(Node(Document()))
@code_llvm print_node_dispatch(Node(Document()))

In this case, nxt is clearly type-unstable (::Union{Nothing,Node}), but the LLVM indicates that Julia is smart enough to branch this such that you have either 3 sequential calls to print_node(::Nothing) or 2 sequential calls to print_node(::Node). No dynamic dispatch in this case as far as I can tell. I wonder though if there are any other cases where it might still be an issue?

For the other things, I opened two bikeshedding issues: JuliaDocs/MarkdownAST.jl#10, JuliaDocs/MarkdownAST.jl#11.

MichaelHatherly / CommonMark.jl

Separate package for Markdown AST #41