MichaelHatherly / CommonMark.jl

A CommonMark-compliant Markdown parser for Julia.
Other
84 stars 11 forks source link

Scoping for Franklin #1

Closed tlienart closed 3 years ago

tlienart commented 4 years ago

Hello Michael,

I'm interested in slowly moving over to CommonMark.jl as it's likely to solve a bunch of sneaky issues in Franklin. You mentioned on Discourse you'd be happy to discuss inclusion of extensions that would be relevant to franklin (thanks!). At a high level the parsing works as follows:

  1. go over source page, find tokens
  2. associate tokens into blocks from an opening token to a closing token
  3. resolve the blocks recursively

The resolution of "plain text" blocks is passed to Julia's Markdown for things like links, lists, ... other blocks that are processed first by Franklin include "command blocks", "div blocks", math blocks, code blocks etc. I don't think it would make a lot of sense to use CommonMark.jl for those yet.

The first obvious step is to use CommonMark.jl for the "level 0" instead of Julia's Markdown. I had already looked into using GithubMarkdown.jl (which wraps around pandoc) for this but there is one key thing I must be able to do: be able to ignore the conversion of indented blocks. (I want users to be able to use indentation for other stuff like making their div blocks clearer). Currently I capture the result of Markdown.parse, and hack them back into a paragraph block so that they're not converted. How could I do something like it in with CommonMark.jl?

The key small function is https://github.com/tlienart/Franklin.jl/blob/772dde56872b8834bf52ec73c05b4760a1844ddb/src/converter/markdown/utils.jl#L9-L30

which, at a high level, takes the level 0 stuff ("plain markdown") and converts it to HTML. Note that there are two tricks:

  1. (key thing) indented code blocks are forced back into paragraphs
  2. (minor thing) {{...}} are converted by the Markdown module into html entities, I don't want that to happen, is that something that would happen with CommonMark.jl ?

Assuming I could do both of these things easily with CommonMark.jl, it should be mostly straightforward to make Franklin use it (and I'd be very happy about it, finally proper lists etc...!)

Thanks and apologies for the long winded message

MichaelHatherly commented 4 years ago

be able to ignore the conversion of indented blocks

The spec does bake-in a little bit with regards to indentation as far as I know, but on trying to disable indented code it doesn't appear to break much. So might be possible to achieve that fairly easily.

julia> p = CommonMark.Parser();

julia> text =
       """
           indented
       """
"    indented\n"

julia> ast = CommonMark.parse(p, text)
Node(CommonMark.Document)

julia> CommonMark.ast_dump(ast)
Document
  CodeBlock
    "indented\n"

julia> pop!(p.block_starts)
indented_code_block (generic function with 1 method)

julia> ast = CommonMark.parse(p, text)
Node(CommonMark.Document)

julia> CommonMark.ast_dump(ast)
Document
  Paragraph
    Text
      "indented"

Whether disabling it causes odd behaviour elsewhere I couldn't say for certain just yet. Would be worth trying out though.

I would absolutely love to get rid of indented code blocks completely from the syntax, but alas that's never going to be a realistic goal, it makes parsing quite a bit more complex than it needs to be. https://johnmacfarlane.net/beyond-markdown.html was good reading while putting this package together.

(minor thing) {{...}} are converted by the Markdown module into html entities

Those aren't converted. I thought for a moment that I might have implemented that wrong, but babelmark doesn't either if I'm not mistaken.

If those are the things you'd need to get going then you might have some success, hopefully!

tlienart commented 4 years ago

Thanks for the code, awesome I look forward to trying this out over the weekend :)

tlienart commented 3 years ago

Sorry for briefly re-opening this, the trick you suggested does not work anymore but this:

CommonMark.indented_code_block(::CommonMark.Parser, ::CommonMark.Node) = 0

seems to work fine. It preserves base markdown formatting that happens in the indentation for instance:

julia> p = CommonMark.Parser();
julia> t = """
       ab **and**

           cd
           ef **gh**
               foo `bb`
           bar

       end"""
"ab **and**\n\n    cd\n    ef **gh**\n        foo `bb`\n    bar\n\nend"

julia> CommonMark.ast_dump(p(t))
Document
  Paragraph
    Text
      "ab "
    Strong
      Text
        "and"
  Paragraph
    Text
      "cd"
    SoftBreak
      ""
    Text
      "ef "
    Strong
      Text
        "gh"
    SoftBreak
      ""
    Text
      "foo "
    Code
      "bb"
    SoftBreak
      ""
    Text
      "bar"
  Paragraph
    Text
      "end"

apart from the fact that it's a dirty trick, do you see anything that could go terribly wrong from the CommonMark.jl perspective if I use this? thanks!

MichaelHatherly commented 3 years ago

Couldn't say with any certainty whether that will do anything terribly bad, worth trying it out, though parsing CM can be pretty weird at times. You can avoid the type piracy on that method (I assume you're redefining it within the REPL (and Franklin)? by doing something like the following:

struct SkipIndented end
block_rule(::SkipIndented) = Rule((p, c) -> 0, 8, "")
julia> p = enable!(disable!(Parser(), IndentedCodeBlockRule()), SkipIndented())
Parser(Node(CommonMark.Document))

This swaps out the indented block parser with your definition without the piracy. (I was surprised that it didn't actually break anything at all.)

julia> CommonMark.ast_dump(p(t))
Document
  Paragraph
    Text
      "ab "
    Strong
      Text
        "and"
  Paragraph
    Text
      "cd"
    SoftBreak
      ""
    Text
      "ef "
    Strong
      Text
        "gh"
    SoftBreak
      ""
    Text
      "foo "
    Code
      "bb"
    SoftBreak
      ""
    Text
      "bar"
  Paragraph
    Text
      "end"

This needs a tiny patch to CM, since there's a bug in disable! that this uncovered (which I'll push tomorrow):

diff --git a/src/parsers/rules.jl b/src/parsers/rules.jl
index a20a582..2626c3e 100644
--- a/src/parsers/rules.jl
+++ b/src/parsers/rules.jl
@@ -50,7 +50,7 @@ function disable!(p::AbstractParser, rules::Union{Tuple, Vector})
     empty!(p.inline_parser.inline_parsers)
     empty!(p.inline_parser.modifiers)
     filter!(f -> f ∉ rules, p.rules)
-    return enable!(p, p.rules)
+    return enable!(p, copy(p.rules))
 end
 disable!(p::AbstractParser, rule) = disable!(p, [rule])
tlienart commented 3 years ago

Cool thank you! I'll use your trick

MichaelHatherly commented 3 years ago

That fix with be tagged shortly: https://github.com/MichaelHatherly/CommonMark.jl/commit/438032d3f338166978e623f865aabb6d1127be9f.

I'll note here that I'd regard what I used to make that trick work can be regarded as public interface and won't be disappearing at random. The previous trick with push! was definitely "doing bad things".