MichaelHatherly / CommonMark.jl

A CommonMark-compliant Markdown parser for Julia.
Other
87 stars 11 forks source link

dot/period creates another node with Typography rule #9

Closed drrajeshtalluri closed 4 years ago

drrajeshtalluri commented 4 years ago

Thank you for this excellent package. I was trying to parse some markdown and look at the parser tree. I was wondering if the Typography rule is supposed to make dot be a separate node?

using CommonMark

parser = Parser()
enable!(parser, TypographyRule())

parser("# Heading. \n This is a Heading.") |> CommonMark.ast_dump
Document
  Heading
    Text
      "Heading"
    Text
      "."
  Paragraph
    Text
      "This is a Heading"
    Text
      "."

I could not figure out the exact change needed however when checking for ellipsis I think you are checking for one or more dots. Maybe only append_child when it matches 3 dots in the function.

function parse_ellipsis(parser::InlineParser, block::Node)
    m = consume(parser, match(r"^\.+", parser))
    append_child(block, text(length(m.match) === 3 ? "\u2026" : m.match))
    return true
end

Thank You!

MichaelHatherly commented 4 years ago

Thank you for this excellent package.

You're welcome!

I was wondering if the Typography rule is supposed to make dot be a separate node?

We could run some concatenation code when appending nodes if the last added is the same as the one to be added, such as this case with consecutive Node(Text())s, might reduce AST size a bit. It shouldn't effect the actual document structure though, unless you've got a specific case in mind where it does.

What this does reveal though is a bug in handling of more than 3 periods that doesn't match what the reference parser produces:

julia> CommonMark.ast_dump(p("# heading...."))
Document
  Heading
    Text
      "heading"
    Text
      "...."

julia> CommonMark.ast_dump(p("# heading..."))
Document
  Heading
    Text
      "heading"
    Text
      "…"

That first one, with .... should greedily consume the first 3 periods. The fix should be reasonably simple:

function parse_ellipsis(parser::InlineParser, block::Node)
    m = consume(parser, match(r"^\.{3}", parser))
    return m === nothing ? false : (append_child(block, text("\u2026")); true)
end
julia> CommonMark.ast_dump(p("# heading...."))
Document
  Heading
    Text
      "heading"
    Text
      "…"
    Text
      "."

julia> CommonMark.ast_dump(p("# heading..."))
Document
  Heading
    Text
      "heading"
    Text
      "…"

I'll push the fix later this evening unless you'd like to make a PR?

drrajeshtalluri commented 4 years ago

Thank you! Please go ahead with your fix. I am trying to extract some nodes to process them with different rules and add additional nodes after them when I saw these extra nodes and opened the issue. I do not see a situation where this would affect the parsing except the size of the AST. Concatenating similar adjacent nodes sounds like a great idea to clean up the AST after modification and addition of nodes that were added later on. I will try that. Thanks for the feedback!