Implementing a table of contents feature for CommonMark

drrajeshtalluri commented 4 years ago

Hi, I have been trying to implement a toc (table of contents) feature using the ast generated from Commonmark. However, my implementation is too convoluted and messy. It would really help if I can get any feedback on better ways to create the toc and if I could contribute this in any way to this excellent package.

The logic of my implementation is to create a hierarchical JSON data structure from ast created by CommonMark. I wanted to have it in JSON, as we can use the JSON data in javascript to create different types of toc's at multiple levels on a webpage. If it is just HTML ul,li list then there may be other better approaches.

This is the function I wrote to create a hierarchical dict:

# input for this function is an ast for a markdown document created with a commonmark parser
function create_toc(ast)
    toc = [] # initialize toc

    ## iterate through toc and extract all headings, links and level  of the headings
    for (node, enter) in ast
        if enter && node.t isa CommonMark.Heading
            hstr = string(node.t)
            hlvl = match(r"CommonMark.Heading\((.)\)", hstr)

            title_str = ""
            for (node, enter) in node
                if enter && node.t isa CommonMark.Text
                    title_str *= node.literal
                end
            end
            push!(toc, 
        Dict{Any,Any}("hlvl" => hlvl[1],
              "id" => node.meta["id"],
              "title" => title_str,
              "children" => nothing )) # create children field as it is needed
        end
    end

    # create another copy of the toc and modify it to create a hierarchical dict 
    ntoc = deepcopy(toc)

    ## this function finds the parent of a heading if any otherwise 0
    function findparent(tlist, nid)
        x = [i["hlvl"] for i in tlist]
        x = parse.(Int, x)
        a1 = x[nid]
        if a1 == 1 
            return 0
        else
            return findlast(x[1:nid] .== (a1 - 1))
        end
    end

    ## find parents for each heading
    pnode = [findparent(toc, i) for i in 1:length(toc)]

    for i in 1:length(toc)
        node = toc[i]
        push!(node, "idno" => i)
        push!(node, "parent" => pnode[i])
    end

    # attach children to their parents in reverse creating a hierarchical structure
    for node in reverse(toc)
        if node["parent"] != 0
            if ntoc[node["parent"]]["children"] === nothing
                ntoc[node["parent"]]["children"] = [ntoc[node["idno"]]] 
            else
                ntoc[node["parent"]]["children"] = [ntoc[node["idno"]], ntoc[node["parent"]]["children"]]
            end
        end
    end
    return ntoc[pnode.==0]
end

Example usage

using CommonMark
using JSON3
using YAML
parser = CommonMark.Parser()
CommonMark.enable!(parser, CommonMark.DollarMathRule())
CommonMark.enable!(parser, CommonMark.RawContentRule())
CommonMark.enable!(parser, CommonMark.TypographyRule())
CommonMark.enable!(parser, CommonMark.FrontMatterRule(yaml=YAML.load))
CommonMark.enable!(parser, CommonMark.AttributeRule())
CommonMark.enable!(parser, CommonMark.AutoIdentifierRule())
CommonMark.enable!(parser, CommonMark.AdmonitionRule())
CommonMark.enable!(parser, CommonMark.FootnoteRule())
CommonMark.enable!(parser, CommonMark.TableRule())
CommonMark.enable!(parser, CommonMark.MathRule())
CommonMark.enable!(parser, CommonMark.CitationRule())

ast = parser("---\r\ntitle: \"test toc\"\r\n---\r\n# Hello *world*\r\n\r\nThis is first heading.\r\n\r\n# Second heading\r\nthis is second heading\r\n\r\n## Level2 heading\r\n\r\nthis is level 2 heading\r\n\r\n### Level3 heading\r\n\r\nthis is level 
3 heading\r\n\r\n## Level2 heading\r\n\r\n\r\nduplicate level 2 heading\r\n\r\n# Third heading\r\nthis is third heading\r\n\r\n\r\n\r\n## Level2 heading\r\n\r\n\r\nduplicate level 2 heading\r\n\r\n## Level3 heading\r\n\r\nduplicate level3 heading")

toc = create_toc(ast)

JSON3.write(toc)

Any feedback on implementation or other alternatives is appreciated. Thank You!

MichaelHatherly commented 4 years ago

If I'd to add this kind of feature to the package then avoiding having foreign data structures (JSON) used to mirror document structure would the most ideal. Here's a quick sketch of how I'd do this:

function toc(ast::CommonMark.Node)
    io = IOBuffer()
    for (node, enter) in ast
        if enter && node.t isa CommonMark.Heading
            t = node.t
            node.t = CommonMark.Paragraph()
            link = string("[", rstrip(markdown(node)), "](#", node.meta["id"], ")")
            node.t = t
            println(io, "    "^(t.level-1), "  * ", link)
        end
    end
    return Parser()(seekstart(io))
end

which just builds a nested markdown list with embedded links to the headings and mirrors the levels of the headers themselves and maintains any text formatting found in the headers. It won't be the most efficient way to do it, since we're printing raw markdown to a buffer and then reparsing, but it's definitely the simplest and most understandable I can come up with.

With regards to your implementation:

hstr = string(node.t)
hlvl = match(r"CommonMark.Heading\((.)\)", hstr)

Just use node.t.level for getting the Int describing the header level rather than using a regex.

title_str = ""
for (node, enter) in node
     if enter && node.t isa CommonMark.Text
         title_str *= node.literal
     end
end

This'll strip the formatting, which I assume you're happy with doing, if not, then printing to a specific format to store, or serialising the node to JSON would need to be done.

drrajeshtalluri commented 4 years ago

Thanks so much! This is a great idea. I was trying to reconnect the heading nodes together by using your defined node type but could not figure out how to exactly do that. But the code you provided is even better as it gives the toc in ul/li form which is what is needed for this package, and is what is provided in pandoc and similar parsers. I just needed a hierarchical ast to parse, which we get from your code. Thank you very much for helping me out.

If I may ask about an unrelated feature, I saw that commonmark supports custom tag names. How would I go about changing the tag names and attributes for the nodes in the ast? For example, in the generated toc, if I wanted to change the tag name of <ul> </ul> node created to a custom <u-list> </u-list>. Is there a way to accomplish this by modifying the node information? I could not figure out if there is a custom node type, where we can define the node tag name and attributes for the node.

MichaelHatherly commented 4 years ago

If I may ask about an unrelated feature, I saw that commonmark supports custom tag names.

Could you point me at which implementations have custom tag support? I've not implemented that with CommonMark.jl since I didn't notice it when originally porting from commonmark.js. ul and all others are currently hard-coded into the output functions and so can't be replaced.

One thing you could do to get part-way there is to attach some class attributes to the root of the list when generating it.

{.toc}
  - first item

This attaches a CSS class toc to the outermost list in the table of contents. You can then target that class with your custom CSS and JS. That's the route I've been taking in https://github.com/MichaelHatherly/Publish.jl for customising the generic markdown elements, it's been working pretty well so far.

(Or have I misread your question completely?)

drrajeshtalluri commented 4 years ago

I think the tags are only supported in raw html.

6.8 Raw HTML Text between < and > that looks like an HTML tag is parsed as a raw HTML tag and will be rendered in HTML without escaping. Tag and attribute names are not limited to current HTML tags, so custom tags (and even, say, DocBook tags) may be used.

Custom classes work well to target JS and CSS. I was just wondering that if we had the ability for custom nodes, we could avoid javascript later on as we can create the custom html structure in commonmark.

I was just thinking about how to extend the spec for new elements. Instead of predefining CommonMark.Type for each new type we could have a CommonMark.CustomNode type, with additional type information in the node meta field. We could have a general writer targeted for html or latex for these custom types. Just like the ability to add classes I thought it would be cool to add tags. I do not know if this fits in this package as this is following the Commonmark spec. Just thought I would ask and get your thoughts.

MichaelHatherly commented 4 years ago

There is an undocumented feature of AttributeRule that could be used for something along these lines:

julia> p = enable!(Parser(), AttributeRule())
Parser(Node(CommonMark.Document))

julia> text =
       """
       {:ul-list}
         - one
         - two
         - three
       """
"{:ul-list}\n  - one\n  - two\n  - three\n"

julia> ast = p(text)
  ● one

  ● two

  ● three

julia> ast.first_child.nxt.meta
Dict{String,Any} with 1 entry:
  "element" => "ul-list"

The shorthand attribute syntax {:name} adds element=name metadata to the node. This could be hooked up to the output writers to customise the resulting node types I guess. It's relatively lightweight syntax without having to invent brand new syntaxes for each custom element type.

Not too sure though, hence why it's remained undocumented.

drrajeshtalluri commented 4 years ago

Thanks, this is perfect, I will try to use the AttributeRule. In regards to this issue, as you already created the function toc to generate the table of contents, you can close this issue.

If possible, an example to use the toc for people who want this feature could be helpful.

body = html(ast)
toc_html = html(toc(ast))
content = "<head></head><body><div>$toc_html</div>$body</body>"

Thanks so much for your help!

MichaelHatherly / CommonMark.jl

Implementing a table of contents feature for CommonMark #10