MichaelHatherly / CommonMark.jl

A CommonMark-compliant Markdown parser for Julia.
Other
84 stars 11 forks source link

Keep reference links as references #40

Open mortenpi opened 2 years ago

mortenpi commented 2 years ago

From what I gather from

https://github.com/MichaelHatherly/CommonMark.jl/blob/287e39ae128e80a6647d2d9762bf22040adffea1/src/parsers/inlines/links.jl#L158-L169

is that link labels get expanded to full links during parsing. Would it be possible / reasonable to somehow have a parser option that would keep the link labels and link reference definitions intact in the AST?

The use case I have in mind is to write a small script to automatically verify and possibly update such reference links and their definitions (specifically in the Documenter CHANGELOG).

MichaelHatherly commented 2 years ago

Yes, that's pretty much how it works. It's definitely something that should really be fixed since the expectation is that markdown output from the AST is roundtripable, technically true by replacing the reference links with inline links as it currently is, but would be much nicer if reference links remained as is in the output for markdown. I think it was likely a carryover from the python library that this was originally based off of.

Would it be possible / reasonable to somehow have a parser option that would keep the link labels and link reference definitions intact in the AST?

I'd prefer if it was just the only behaviour rather than adding in options. I wouldn't class the change as breaking since I don't think it would be very likely for users to actually be depending on the current behaviour.

mortenpi commented 2 years ago

I am completely behind having this be the default behaviour. I think it would be more intuitive that way.

I guess for an implementation, there would need to be a new AbstractBlock for the link reference definitions. And then either Link would need to be expanded so that it could represent link labels (apparently, there are three different kinds: [foo][bar], [foo][] and [foo]) or there could be a new AbstractInline for this.

On a slight tangent: how roundtrippable do you intend it to be? For example, I noticed that lists have some default indentation and use -, so lists using * do not get written back out the same way. Would you also imagine that ideally that type of formatting meta information be stored in the AST so that it could be replicated?

MichaelHatherly commented 2 years ago

I guess for an implementation, there would need to be a new AbstractBlock for the link reference definitions. And then either Link would need to be expanded so that it could represent link labels (apparently, there are three different kinds: [foo][bar], [foo][] and [foo]) or there could be a new AbstractInline for this.

Probably a new block type and a new inline type would be a reasonable way to go so we can avoid manually branching in the show methods, which may be slightly cleaner implementation-wise. (It'll probably need some iterations to work out the nicest approach.)

On a slight tangent: how roundtrippable do you intend it to be? For example, I noticed that lists have some default indentation and use -, so lists using * do not get written back out the same way. Would you also imagine that ideally that type of formatting meta information be stored in the AST so that it could be replicated?

Roundtrippable in the sense that the markdown->markdown pipeline acts as a strict code formatter without too many (or any user options, gofmt comes to mind), so indentation gets normalised, math gets written with backticks, lists use a standard char, etc.

mortenpi commented 2 years ago

I have been looking at this a little bit, and I noticed that a reference link without a valid definition for the link label gets interpreted as a simple string. E.g.

[foo][bar]

would not become a Link node, because [bar]: ... is missing.

Now, as far as I can tell, this is completely in line with the spec (it doesn't seem to say it explicitly, but it can be inferred from the examples with problematic link labels; and the spec does say that a reference link needs to have a valid label). But it is a bit of a problem for my use case as I actually want to find such links. My hope is that I could just run the parser and then loop over the AST, finding all the (reference) links.

I can think of two options:

  1. Allow for an option to parse reference links even if they do not have a valid labels. This, of course, goes against the spec and introduces an option to the parser, both of which would be good to avoid.

  2. Allow the user to hook into the parser to capture the invalid links. I think allowing the user to somehow access the else branch of this if should be sufficient, e.g. with a callback function.

    https://github.com/MichaelHatherly/CommonMark.jl/blob/287e39ae128e80a6647d2d9762bf22040adffea1/src/parsers/inlines/links.jl#L161-L164

(2) seems better, as that doesn't touch the parsing behavior. And for my use case, it should also be sufficient: after catching all the invalid links in one parsing round, I could update the .refmap (not sure if it gets reset though) and then re-run the parser.

MichaelHatherly commented 2 years ago

Yeah, 2 looks like a better option perhaps, might be nice to have a general mechanism for hooking into different places during parsing for this kind of stuff. Any idea whether there's other MD parsers that allow these kind of "hooks"?

mortenpi commented 2 years ago

I was thinking about how to properly implement finding these invalid reference links. My original implementation was to add another field to InlineParser that would contain the fallback, which the user could set with p.inline_parser.reflink_callback = .... However, that feels a little ugly.

So instead I was thinking that maybe customizing LinkRule (and ImageRule too then I guess) would be a better option. In fact, I think it would even be possible to implement this as an external user-defined rule, by just copying and modifying the functions in links.jl. However, I am a little uncertain as to how stable the internal parsing logic is, so I am not sure I would want to copy the relatively long parsing code out of CommonMark for this, and so modifying things here might still be preferred.

With all that in mind:

  1. I am not sure if there's precedent for this, but would it be fine for the rule structs like LinkRule to have fields that customize their behaviour?
  2. If yes, then it would be necessary to access the actual rule object in the internal functions. However, I am not sure what would the best way to pass it through the Rule calls. Closures for the parser_* functions ought to work though. I also thought of finding isa LinkRule from p.rules, but the Parser object actually does not get passed (only InlineParser).
  3. I noticed that if you enable! the same rule multiple times, it just gets appended to p.rules. Should the enable! function maybe check if a particular rule is already enabled and no-op then? I am not quite sure what effect it has if you have e.g. two LinkRules enable!d, especially if they do slightly different things.

By the way, I did look through the docs of a few Markdown parsers, but I did not find any that would expose something like this.