Feature: Support for Pandoc YAML front matter

kenmuse commented 7 years ago

The application currently supports YAML front matter using the fenced block characters --- at the top of the file. Pandoc's flavor of markdown allows the block to also start or end with .... In that flavor, either set of characters can be used at the start or end of the block. As a result, documents using Pandoc markdown frequently start the YAML block with --- and end it with ... For example:

---
title: document
author: someuser
...
# Start the document here

Since the application supports Pandoc rendering with the optional add-in, have you considered supporting the Pandoc flavored YAML headers? If so, I'd be glad to submit a PR to enable this support, including unit tests and a MarkDig extension for processing the Pandoc flavored headers.

RickStrahl commented 7 years ago

Nice.

I'd suggest you submit the Markdig extension as a PR to Markdig since this is really an addition that would be useful for Markdig to support and could perhaps be added to the existing Yaml parser extension to avoid having to add another extension into the stack.

RickStrahl commented 7 years ago

Took a look and there are a couple of concerns. The code works great for the main part of the parser.

But there are a couple of concerns:

The Weblog Addin also needs to support this and I fixed that part with some extra parsing of the YAML block.

But I'm not sure I understand when the --- to ... applies. When running with the Pandoc parser in MM Pandoc it parses both the --- and ... endings.

I guess I'm not sure why you'd want to use the ... syntax when --- works perfectly fine in the majority of parsers including Pandoc. Is there a specific Pandoc mode you are using (I think MM uses GFM Markdown Pandoc setting if I recall right).

RickStrahl commented 7 years ago

So for clarification I've looked at my Pandoc parser addin.

It runs with these options:

"-f markdown_github -s \"{fileIn}\" -o \"{fileOut}\""

(can be changed in the PandocAddin configuration).

I also removed the custom StripFrontMatter code that I have added (which basically pre-parses the Markdown text and could also handle the front matter stripping).

Here's what I see:

This is with ... which should parse as you mention.

But it also parses with "---":

Notice also that Pandoc takes it upon itself to pick out the title and render an extra H1 tag for the header (quite annoying actually). I added the custom FrontMatter stripping to avoid this potential dual header rendering issue with Front Matter so it's consistent with Markdig.

Not sure what the best way for through this is, but I think the Markdig behavior seems to me the more sensible approach. But then again, that's why you get to choose the Markdown Parser :-) so it might be best for me to remove the custom FM pre-parsing code I have for Pandoc, which still leaves the parser with support for both ... and --- endings for the YAML block.

Which brings us back to why do we need to even support the ... syntax since --- seems to be supported much more commonly.

kenmuse commented 7 years ago

Personally, also not a big fan of the mixed fence in Pandoc. Unfortunately, docs targeting Pandoc or based on samples from Pandoc's site frequently follow that pattern. Pandoc supports dot or dash starts and dot or dash ends to the FM; they seemed to standardize in their own documentation on leading with dashes and ending with dots.

I think the Jekyll approach adopted in MarkDig (triple dash fencing) is more consistent. That's part of why I didn't carry the changes to the Weblog Plugin. The issue really occurs when working with existing Pandoc markdown files. Adding support for the additional variant in MM made it easy to view/edit both formats in a consistent manner. The fact that the front matter is removed while in the editor was never an issue for me. That actually made it easier to work with IMO.

Completely agree about the H1 handling in Pandoc. It doesn't give you a way to work around that in the current version of Pandoc (esp with PDF and Word files). I liked the consistency that stripping the YAML provided for previewing in the various markdown environments. I proposed this particular enhancement since it supported that consistency when working from existing Pandoc markdown documents.

I agree with your point w/r/t submitting the extension piece to MarkDig as a part of their package and will look into that.

RickStrahl commented 7 years ago

If you can see if Alexandre would go for this because that would be the least intrusive way to handle this IMHO, especially if it goes directly into the original YAML Parser extension. It's probably just a few lines in the original formatter.

In the meantime I added the formatting for the Weblog publishing parser.

If Markdig will go for the extensions I'd like to hold off, otherwise I'll merge this in as is. We'll still need the editor extensions for spellcheck.

Thanks for this...

kenmuse commented 7 years ago

This is being tracked with Alexandre as lunet-io/markdig#138. In reviewing the specs, we noticed that the Pandoc spec is more explicit: dash start, dot or dash end. I'll resubmit a corrected PR once things have progressed with Markdig.

RickStrahl commented 7 years ago

Thanks Ken for your work on this. I thought --- start and --- or ... end was what we were parsing for anyway - I missed the ... start altogether (and didn't implement that in the Weblog publishing bits). Anyway - looks like your code changes in the PR that you made here could be the replacement in the base Yaml Parser to support both --- and ... trailing delims.

kenmuse commented 7 years ago

Submitted the patch to the base YAML parser with Alexandre. The new package is already available via NuGet as part of the 0.13.3 release. It now supports both trailing delimiters and provides parity with Pandoc's behaviors. I'll try to update the PR tonight.

RickStrahl commented 7 years ago

That's awesome Ken (and @xoofx).

kenmuse commented 7 years ago

The changes are almost ready. Found some related Regex bugs that I'm cleaning up. Question about Line 133 of YamlExtractionRegExTest. The sample blog post is leading with a new line instead of the ---. That would make the parsing fail if it wasn't using the more permissive regex included in the test. Should I correct that test and tweak the code there and in MarkdownParserBase to share the readonly Regex from MarkdownUtilities?

RickStrahl commented 7 years ago

Yeah I think the test needs:

Regex YamlExtractionRegex = new Regex("^---[\n|\r\n].*?^---[\n|\r\n]", RegexOptions.Multiline | RegexOptions.Singleline);

and fixing the constant string to start with the YAML header to work. Go ahead and fix that. That test probably shouldn't be there anyway - it was just to see if the expressions would work.

In fact the WebLogPostMetaData class uses that expression so that should be fine. Probably should make that public:

static readonly Regex YamlExtractionRegex = new Regex("^---[\n,\r\n].*?^(---[\n,\r\n]|...[\n,\r\n])", RegexOptions.Singleline | RegexOptions.Multiline);

to:

public static readonly Regex YamlExtractionRegex = new Regex("^---[\n,\r\n].*?^(---[\n,\r\n]|...[\n,\r\n])", RegexOptions.Singleline | RegexOptions.Multiline);

in WebLogPostMetaData.cs. Probably should just make that public and use that in the test.

kenmuse commented 7 years ago

Should that class and MarkdownParserBase just rely on the front matter Regex defined in MarkdownUtilities to limit the number of places that each have their own definition?

RickStrahl commented 7 years ago

Yes... that sounds good to me. This code kind of evolved over various stretches of time so there is quite a bit of duplication. I need to clean that up eventually.

kenmuse commented 7 years ago

Applied the changes. Also updated the regex to cover the complete set of rules (including space handling) and to remove a bug that let it match commas as a line-end.

RickStrahl commented 7 years ago

Thanks for your help. Can you send me your contact info for a free license.

kenmuse commented 7 years ago

Emailed you the requested info

RickStrahl / MarkdownMonster

Feature: Support for Pandoc YAML front matter #223