MichaelHatherly / CommonMark.jl

A CommonMark-compliant Markdown parser for Julia.
Other
83 stars 11 forks source link

Separate package for Markdown AST #41

Open mortenpi opened 2 years ago

mortenpi commented 2 years ago

I flouted this idea a long time ago, but I would now like to open this up for more specific discussion and feedback.

In short, I think it would be useful to have some sort of an interface package for representing the Markdown AST (basically, the Node type, the AbstractContainer abstract types, and the specific container types; + APIs). At the same time decoupling it from the parser and writer details, to keep it stable and lightweight. I also imagine it would provide clean, documented APIs to then work with that AST (e.g. returning all children, accessing the node data; as far as I can tell, CommonMark currently doesn't really provide that many official APIs for this kind of stuff).

It would allow parsers and consumers (e.g. writers, Documenter) to pass Markdown AST back and forth via a standard interface. In principle, it could all also live here and achieve the same thing, but the AST would then be versioned together with changes to e.g. parsers and writers. I feel that having those things versioned separately would be good (the AST package would be very conservative, but CommonMark could still make rapid changes).

The design of the AST here (a single Node type that handles the structure of the AST), when compared to the Markdown standard library AST, feels much better (provides type stability, but also easily allows a lot of general operations on the whole tree without having to special case for each node type). As such, I think lifting the AST from here to a separate package and then sprinkling some APIs and documentation around it could largely be all that is needed. CommonMark, I imagine, would then depend on the AST package and use the AST types from there.

If this seems like a reasonable idea, I am volunteering to do the legwork. This would be a part of some refactoring and cleaning up of Documenter I would like to do anyway.


A few other thoughts:

MichaelHatherly commented 2 years ago

100% behind making the representation of the AST more standalone, as well as the route you're proposing.

should probably be moved to the AST package is the terminal one

That one also needs to be re-written without use of Crayons, this is one of the main blocking issues if being able to upstream CM into the standard libs.

Just as a quick side project, I might revive CMark.jl, which links against the libcmark and libcmark-gfm C parsers, and have it also produce this standardized AST. I don't think users should use the C parsers if they're parsing Markdown (CommonMark.jl is preferred), but this could be useful for testing.

That would be quite nice to be able to cross-check against other implementations like that, would likely unearth some nice corner cases.

mortenpi commented 2 years ago

I now have put together a WIP version of a separate AST package. It needs heaps of corner-case fixing, general cleanup and more documentation, but I think the only major thing that is missing is the terminal printing.

However, it does deviate from the implementation here in some ways (I found it easier to implement these changes right away, rather than starting by verbatim lifting things from here). I would love to have your feedback on those changes, especially if you have objections or suggestions for alternatives.

The main differences from the CommonMark.jl implementation should be:

  1. The Node type has a fewer fields, dropping a bunch of the code metadata fields, and is instead a parametric type. The type parameter can be used to add additional fields.

    My assumption is that in CommonMark would have its own metadata container e.g.

    mutable struct CommonMarkMeta
       sourcepos::SourcePos
       last_line_blank::Bool
       last_line_checked::Bool
       is_open::Bool
       literal::String
       meta::Dict{String,Any}
    end
    
    const CommonMarkNode = Node{CommonMarkMeta}

    and would return Node{CommonMarkMeta} trees.

  2. The "null" node in the pointers is represented by a nothing (as opposed to NULL_NODE), and the fields are Union{Node{M}, Nothing}.

    This pattern feels more idiomatic to me in Julia, but I am not sure if there is e.g. a performance reason for using the NULL_NODE pattern instead. But I believe the small unions are performant and type-stable?

  3. Accessing the various fields of a Node are now done via accessor method (e.g. next(), children(), node[] for the AbstractContainer instance, or meta() for the user-defined metadata; more in docs of Node). However, I don't think it would be unreasonable to stick to fields/properties for some of these APIs.

  4. AbstractContainer -> AbstractElement as the top level abstract type. I felt that using the word "container" here was not quite accurate, since some of them are not containers, but leafs.

    "Element" however, feels like a good general word referring to the different types of nodes. I borrowed it from HTML, but the CommonMark standard actually also refers to "structural elements" (e.g. "We can think of a document as a sequence of blocks—structural elements like paragraphs, ...").

  5. Instead of append_child and prepend_child, I overload push! and pushfirst! for this. I felt that "append"/"prepend" could be confusing, as in the standard library append! and prepend! concatenate two collections, rather than adding an element. However, at the same time, I am not really sure it makes sense to think of a node as a "collection of its children", which this choice implies.

  6. Some of CommonMark elements (AbstractContainers) use the .literal field for their data (e.g. Text). I think that all information about the element should be in the AbstractElement (AbstractContainer) object (to avoid having to go back to the Node). So e.g. Text() elements now have a .text field that contains the text content, and I think a few other elements have the same change.

Sorry again for the TL;DR :sweat_smile: There are plenty WIP items in the current implementation and I have opened a bunch of issues on that repo too, mainly keep track of stuff for myself.

MichaelHatherly commented 2 years ago

Thanks for the updates.

The type parameter can be used to add additional fields.

Any idea yet whether that's going to require significant changes in CM.jl to switch, or will changes be minimal? Currently the struct mirrors the what is used in the JS and Python libs on which this is based, ideally I don't want to stray too far from there so that folding changes from there into this package is easier.

But I believe the small unions are performant and type-stable?

If it's just as fast then that's fine, when originally written it was required to avoid dynamic dispatch in many places.

node[] for the AbstractContainer instance

Perhaps container() rather than overloading getindex, which adds inconsistencies in how you access particular parts of the nodes.

e.g. next()

Probably too generic, unless we're expecting to not export?

AbstractContainer -> AbstractElement as the top level abstract type. I felt that using the word "container" here was not quite accurate, since some of them are not containers, but leafs.

That's fine.

Instead of append_child and prepend_child, I overload push! and pushfirst! for this. I felt that "append"/"prepend" could be confusing, as in the standard library append! and prepend! concatenate two collections, rather than adding an element. However, at the same time, I am not really sure it makes sense to think of a node as a "collection of its children", which this choice implies.

Those were intentionally not added to the push! and pushfirst! methods since I didn't feel they could really reasonably be classed as "array-like" enough for it not to be punning.

So e.g. Text() elements now have a .text field that contains the text content, and I think a few other elements have the same change.

Same comment about matching the upstream packages that this one is based on. This might be more borderline than changing the fields of the node type, so might be more acceptable.


I'll have more comments when I can get around to looking over the package itself.

mortenpi commented 2 years ago

Ah, I didn't consider keeping the code and API similar to the other implementations as a goal. The downside of sticking to those conventions is that it would restrict us in providing the cleanest, most Julian API.

However, I think we could get the best of both worlds with some getproperty/setproperty! magic. We can have whatever structure / API for Node we want, and then here just overload the method specific to CommonMark, e.g.

function Base.getproperty(node::Node{CommonMarkMeta}, name::Symbol)
    if name === :literal
        meta(node).literal
    elseif name === :nxt
        next(node)
        # etc.
    else
        # Fallback to the properties of general Node type. This is useful if we e.g. decide
        # to provide a .children property, without having to re-implement all the general
        # properties of Node for the specialized Node{CommonMarkMeta}.
        invoke(getproperty, Tuple{Node,Symbol}, node, name)
    end
end

to provide the necessary "fields" that can be used internally in CommonMark. This way we should be able to get away with really minimal changes to CommonMark's code.

I think Julia's constant propagation keeps the whole thing performant too. For example: ```julia struct Bar x :: Any end function Base.getproperty(value :: Bar, name :: Symbol) if name === :typeofx println(typeof(value.x)) elseif name === :y println("yyy") else getfield(value, name) end end function bbtest(b::Bar) b.typeofx x = b.x b.y return x end @code_llvm bbtest(Bar(123)) ``` The LLVM IR clearly shows that the `println` etc. are nicely inlined, rather than doing `if`s on every `getproperty` call.

So e.g. Text() elements now have a .text field that contains the text content, and I think a few other elements have the same change.

Same comment about matching the upstream packages that this one is based on. This might be more borderline than changing the fields of the node type, so might be more acceptable.

An option here could be to do some more complex overloading of the .literal field, which for certain element types returns/sets the value from the element, rather than from the .meta.literal field (or, perhaps even keeps both consistent).


But I believe the small unions are performant and type-stable?

If it's just as fast then that's fine, when originally written it was required to avoid dynamic dispatch in many places.

I am not super-informed about this, just something I've read in passing (e.g. e.g. this blog post). However, I constructed this test case:

@noinline _next(node::Node) = node.nxt
@noinline print_node(::Nothing) = println("- nothing")
@noinline print_node(::Node) = println("- node")
function print_node_dispatch(node)
    nxt = _next(node)
    print_node(nxt)
    print_node(nxt)
    if isnothing(nxt)
        print_node(nxt)
    end
end

@code_warntype print_node_dispatch(Node(Document()))
@code_llvm print_node_dispatch(Node(Document()))

In this case, nxt is clearly type-unstable (::Union{Nothing,Node}), but the LLVM indicates that Julia is smart enough to branch this such that you have either 3 sequential calls to print_node(::Nothing) or 2 sequential calls to print_node(::Node). No dynamic dispatch in this case as far as I can tell. I wonder though if there are any other cases where it might still be an issue?


For the other things, I opened two bikeshedding issues: JuliaDocs/MarkdownAST.jl#10, JuliaDocs/MarkdownAST.jl#11.