MichaelHatherly / CommonMark.jl

A CommonMark-compliant Markdown parser for Julia.
Other
84 stars 11 forks source link

`Raw/Noop` extension #4

Closed domluna closed 4 years ago

domluna commented 4 years ago

Could there be an extension that doesn't alter the input at all? This is required to properly format docstrings for https://github.com/domluna/JuliaFormatter.jl/pull/231. The use case is using the markdown package to find Julia code in the docstrings, format it, and then print the output. Aside from what is formatted nothing else should be altered.

Here's an example

        str = """
        \"""
        \\\\
        \"""
        x"""

Currently using Markdown or CommonMark with RawContentRule the output is not the same as the input.

using Markdown
using CommonMark

Markdown.plain(Markdown.parse(str)) == str # false

parser = Parser()
enable!(parser, RawContentRule())
markdown(parser(str)) == str # false
mortenpi commented 4 years ago

Don't we inherently throw away some information when going from source to AST? So you'd need AST nodes that retain much more information, to be able to re-create the details of the original formatting?

MichaelHatherly commented 4 years ago

Unlike Markdown.plain the intension for CommonMark.markdown is for it to be reasonably roundtripable since it's used in the notebook output for markdown cells and so does need to handle escapes correctly where it can. Lack of backslash escaping is a bug here.

As to source-identical output in markdown, that's not a goal. It'll always just aim for a canonical form of things, such as always using atx headings whether the source wrote atx or setex headings. There is some sense of source position in the .sourcepos field of Node, but that's more used to handle certain parsing rather than for a CST-style tree.

With regards to conditional formatting of code blocks: I'm pretty sure that could be done using a small extension that intercepts the right CodeBlock nodes after parsing. You won't be able to avoid parsing the rest of the syntax though that would likely interfere with indented code blocks in weird ways and not be roundtripable. If you do want JuliaFormatter to be more aggressive in what it formats and also handle the markdown rather than just the code around it then this would be an option.

domluna commented 4 years ago

With regards to conditional formatting of code blocks: I'm pretty sure that could be done using a small extension that intercepts the right CodeBlock nodes after parsing. You won't be able to avoid parsing the rest of the syntax though that would likely interfere with indented code blocks in weird ways and not be roundtripable. If you do want JuliaFormatter to be more aggressive in what it formats and also handle the markdown rather than just the code around it then this would be an option.

This could be a solid option. I know extra trailing newlines are removed from the markdown (at least at the end of the file). Are there more invasive changes that occur?

MichaelHatherly commented 4 years ago

Are there more invasive changes that occur?

If there are any significant formatting changes that aren't liked then we can just adjust markdown to give nicer results. I'm not set yet on a particular style of how the markdown should be formatted. Simpler is better though. I'm pretty sure there's currently a lot of trailing whitespace that gets left behind when writing to markdown, and possibly translation of HTML entities needs to be done

This could be a solid option.

A very simple POC that can use any package that has a String -> String formatting method:

julia> using CommonMark, DocumentFormat, JuliaFormatter

julia> struct FmtRule
           λ::Function
       end;

julia> CommonMark.block_modifier(rule::FmtRule) = CommonMark.Rule(1) do parser, block
           if block.t isa CommonMark.CodeBlock && block.t.info == "julia"
               block.literal = rule.λ(block.literal)
           end
       end;

julia> p_1 = enable!(Parser(), FmtRule(JuliaFormatter.format_text));

julia> p_2 = enable!(Parser(), FmtRule(DocumentFormat.format));

julia> text =
       """
       ```julia
       struct Foo{A, B}
        a::A
         b::B
       end
       ```
       ```
       not formatted
       ```
       """;

julia> markdown(stdout, p_1(text))
```julia
struct Foo{A,B}
    a::A
    b::B
end
```

```
not formatted
```

julia> markdown(stdout, p_2(text))
```julia
struct Foo{A,B}
    a::A
    b::B
end
```

```
not formatted
```

Those internal CommonMark methods and structs aren't settled yet, but the plan is to have a public API for 3rd-party extensions like these by the time 1.0 is released.

bramtayl commented 4 years ago

@MichaelHatherly I'm a bit confused by the code above. Can you explain a bit what the different parts do? Or are there docs somewhere?

MichaelHatherly commented 4 years ago

Or are there docs somewhere?

Not much in the way of internal docs at the moment.

julia> struct FmtRule
           λ::Function
       end;

defines a new "rule" that can then be enabled! on a particular Parser instance with

p_1 = enable!(Parser(), FmtRule(JuliaFormatter.format_text));

It stores a reference to a formatting function that will be used to format code blocks.

julia> CommonMark.block_modifier(rule::FmtRule) = CommonMark.Rule(1) do parser, block
           if block.t isa CommonMark.CodeBlock && block.t.info == "julia"
               block.literal = rule.λ(block.literal)
           end
       end;

defines an "action" associated with FmtRule that modifies block-level elements in a parsed AST with a priority 1. Actions are run in order of priority (low to high). This particular "action" formats the .literal content of CodeBlocks when their .info is julia.

That's about it with this particular one. There's a number of others found in src/extensions/ that are a reasonably gentle introduction to how the package works. block_modifier (used above), block_rule, inline_modifier, and inline_rule are the building blocks upon which all the parsing machinery works.

Hope that's helpful, let me know if you need any other clarification.