MichaelHatherly / CommonMark.jl

A CommonMark-compliant Markdown parser for Julia.
Other
84 stars 11 forks source link

WIP: Use MarkdownAST representation for the AST #49

Open mortenpi opened 1 year ago

mortenpi commented 1 year ago

This is just a proof of concept, but shows how we can use the Node type from MarkdownAST, by just overloading the properties of the fully qualified type.

julia> using CommonMark, MarkdownAST

julia> p = Parser()
Parser(Node(CommonMark.Document))

julia> md = p("Foo *bar* [baz](x \"y\").")
 Foo bar baz.

# Use the AST printing method from MarkdownAST:
julia> invoke(show, Tuple{IO, MarkdownAST.Node}, stdout, md)
@ast CommonMark.NodeMeta CommonMark.Document() do
  CommonMark.Paragraph() do
    CommonMark.Text()
    CommonMark.Emph() do
      CommonMark.Text()
    end
    CommonMark.Text()
    CommonMark.Link("x", "y") do
      CommonMark.Text()
    end
    CommonMark.Text()
  end
end

I am actually a little surprised that this required literally no code changes. The next step I can take is to replace most of the container/element types, which should also be trivial, since I don't think I changed their fields (Text is an exception).

mortenpi commented 1 year ago

The next step I can take is to replace most of the container/element types, which should also be trivial, since I don't think I changed their fields (Text is an exception).

Okay, that was a little optimistic, but I did manage to hack it until I got it to work without any changes to the rest of CommonMark (except for removing type definitions). However, that was more an academic exercise, and not the final product.

Here are the current differences between CommonMark and MarkdownAST that required workarounds. @MichaelHatherly, would you have a bit of time to make some executive decisions on how you think it would be best to proceed with these? I tried to list some potential options for each of them, but I am unsure which ones would be the most appropriate for CM.

Elements using .literal field

A bunch of elements use the .literal field of Node to store their content (Text, CodeBlock,Code, DisplayMath, InlineMath, HTMLBlock, HTMLInline). In MarkdownAST I would like have all the element-level information to be stored in the element object (e.g. .text field of the Text element). Options:

  1. Overload the .literal field, so that updating it keeps the corresponding element field updated too. This is the approach here right now, and doesn't require any changes to CM.
  2. Update CM so that it uses the element fields directly and make sure we populate them with the final .literal values where appropriate.
  3. Stick to the .literal field for some elements. But I am not a big fan of this, as I strongly feel that you shouldn't have to refer back to the container Node object to figure out e.g. what code is in a code block.

Elements containing additional fields

Some elements contain additional fields that I did not transfer to MarkdownAST:

Mostly, I am not quite sure what information these fields contain, and they look more like some internal parser information. I would like to keep the fields of the elements in MarkdownAST strictly to those that contain some semantic information about the AST.

In the current hack, I have, effectively, added a new field to Node (via the meta object) that can store another element-specific struct that contains the extra fields (it also has a dynamic type, like .t), and added some wrapping and overloads to make sure that everything keeps working. Options:

  1. Add this new element-specific field, and stick with the overloads and a wrapper type. This wouldn't require any code changes to CM.
  2. Add this field, but make accessing it in the parser functions explicit (e.g. node.meta.extra.node_block_type instead of node.t.html_block_type).
  3. Parse with CommonMark-specific intermediate elements that contain the extra information, and then at the end replace them with MarkdownAST elements in one big tree walk (similar to the JuliaExpression -> JuliaValue thing), discarding all the extra fields.
  4. Add all the fields CM needs the elements to have to MarkdownAST (it may well be that we should add some fields that I overlooked, but I don't think we need all of them).

Default constructors

The CommonMark elements generally have zero-argument constructors, and the fields of the elements are populated afterwards. In MarkdownAST, I generally want to construct a valid element right away, so I don't usually have the zero-argument constructors, and I also try to enforce some invariants in the constructors, to make sure you can't construct nonsensical elements. Options:

  1. Update all the uses of constructors to pass some reasonable default values.
  2. Have the elements in CM be just functions (e.g. CodeBlock() = MarkdownAST.CodeBlock("", "")). This would require updating all the places where the elements are used as types though (e.g. function signatures, isa checks).
  3. Add some function for constructing elements with default values to CM (e.g. element(::Type{CodeBlock}) = MarkdownAST.CodeBlock("", "") and call it as node.t = element(CodeBlock)).

In the current hack, I am just committing type piracy, and adding the necessary constructors to the MarkdownAST types.