fletcher / MultiMarkdown-6

Lightweight markup processor to produce HTML, LaTeX, and more.
https://fletcher.github.io/MultiMarkdown-6/
Other
632 stars 91 forks source link

CriticMarkup and blank lines #41

Open fletcher opened 7 years ago

fletcher commented 7 years ago

CriticMarkup is a powerful concept, but does require a bit of thought.

  1. The syntax is orthogonal to that of HTML, and therefore Markdown/MultiMarkdown. It should really be thought of as a separate "layer" of markup -- one that needs to be processed before parsing the underlying MultiMarkdown.

  2. This leads to additional complexity, as many users would prefer to use CriticMarkup as an additional feature of MultiMarkdown, that allows displaying the added/deleted/etc. text in the HTML output(using <ins>, <del>, etc.). In this instance, CriticMarkup needs to be carefully wrapped around, or inside, the underlying Markdown -- not "crossing the streams". For example:

    This is {++**good**++}.
    
    This is **{++not**++}.
  3. Similarly, CriticMarkup becomes difficult to process when used around newlines. For example, using CriticMarkup to describe the act of splitting an existing paragraph into two new paragraphs is tricky, as it requires that MMD "understands" that the opening {++ in the first paragraph connects to the closing ++} in the second paragraph, which violates the normal rules of Markdown:

    This is one para.{++
    
    ++} That is now two.
  4. An alternative could be to require that such CriticMarkup start and end with the opening/closing tags on their own separate lines. This could allow the MMD parser to recognize a CriticMarkup "chunk" and treat it as it's own block:

    This is one para.
    {++
    
    ++}
    That is now two.
  5. Another alternative is to create a separate CriticMarkup parsing step that finds all CriticMarkup and handles accepting or rejecting it, prior to the regular MMD parsing. This would be similar to what happened in MMD version 5.

For now, CriticMarkup doesn't play well "across" paragraphs in MMD. I will consider ways of addressing this in the future, and welcome input!

fletcher commented 7 years ago

A problem with the idea in item 4 is that it would make nesting CriticMarkup blocks really challenging from a complexity/performance standpoint. It might be possible to handle this recursively, but that's not ideal either. I'll keep thinking about it.

mn4367 commented 7 years ago

Splitting paragraphs indeed seems to be tricky. If it's too complex I'd vote for just leaving it out since there is a workaround:

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua. 

{>>Break in this paragraph inserted by John Doe.<<}

Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi
ut aliquip ex ea commodo consequat.

I see that some cases are handled much better now than in previous versions and comments are also kept in the output, very nice thanks! I'd wish that the following cases would be handled differently:

1:

{--

Lorem ipsum...

--}

2:

{--
Lorem ipsum...

Duis aute...
--}

3:

{--

Lorem ipsum...

Duis aute...

--}

In my opinion the output should be:

1:

<del>

<p>Lorem ipsum...</p>

</del>

2:

<del>
<p>Lorem ipsum...</p>

<p>Duis aute...</p>
</del>

3:

<del>

<p>Lorem ipsum...</p>

<p>Duis aute...</p>

</del>

But (with the latest develop sources from today) it is:

1:

<p>{--</p>

<p>Lorem ipsum&#8230;</p>

<p>--}</p>

2:

<p>{--
Lorem ipsum&#8230;</p>

<p>Duis aute&#8230;
--}</p>

3:

<p>{--</p>

<p>Lorem ipsum&#8230;</p>

<p>Duis aute&#8230;</p>

<p>--}</p>

Case 2 is of course debatable since it violates a Markdown rule but when using CriticMarkup it would be a nice addition. Case 1 can easily be fixed by the user by removing the empty lines but if case 3 won't be supported the user had to mark every single paragraph which is quite tedious.

Some time ago we implemented our own preprocessor for MMD in order to keep CriticMarkup comments and we implemented it with a simple rule: If any of the CriticMarkup elements is the only content in this line (ignoring leading and trailing whitespace) then we added one blank line immediately before and after it and then replaced the element with the appropriate HTML tag. This worked very well and enabled us to insert, delete and comment arbitrary sections in the document like in the 3 cases above.

mn4367 commented 7 years ago

CriticMarkup with lists are stil tricky. Unfortunately this one doesn't work, but I can understand that it's maybe too difficult, although it would be very nice:

- one
{++- inserted++}
- two

Fortunately this works:

- one
- {++inserted++}
- three

Staying with the second way of using CriticMark in lists this is the bigger problem:

- one
- {++inserted

    with some
    additional paragraph

    and another one++}
- two

The remedy is

- one
- {++inserted++}

    {++with some
    additional paragraph++}

    {++and another
    one++}
- two

but is difficult to write.

Please don't take this as a criticism but since you asked for help I thought this case was worth mentioning it.

mn4367 commented 7 years ago

I digged a little deeper and I tried this snippet

- one
- <ins>inserted

    with some
    additional paragraph

    and another
    one</ins>
- two

and hoped that the output would be

<ul>
<li><p>one</p></li>
<li><ins><p>inserted</p>

<p>with some
additional paragraph</p>

<p>and another
one</p></ins></li>
<li><p>two</p></li>
</ul>

(which is valid HTML) but instead it looked like this,

<ul>
<li><p>one</p></li>
<li><p><ins>inserted</p>

<p>with some
additional paragraph</p>

<p>and another
one</ins></p></li>
<li><p>two</p></li>
</ul>

which has wrong nesting.

It seems that when MMD processes blocks it sometimes takes everything (text from a Markdown perspective) which is part of the block and then puts it into the generated surrounding HTML tag. If HTML is part of the block the behavior seems to depend on the context but for CriticMarkup it always seems to work like this:

{++Lorem ipsum++}

becomes

<p>{++Lorem ipsum++}</p>

and finally

<p><ins>Lorem ipsum</ins></p>

. This follows the Markdown rules but in the case of CriticMarkup in my opinion it's not the best solution. If the order would be changed to

{++Lorem ipsum++}

{++<p>Lorem ipsum</p>++}

<ins><p>Lorem ipsum</p></ins>

this would solve some if not many problems (unfortunately not <ins><li>- inserted</li></ins> since this isn't valid HTML).

So the rule could be: if a block element is completely surrounded by CriticMarkup then process the block as if the CriticMarkup elements wouldn't be there and afterwards surround it again with <ins> and the like. I checked with the W3C validator and both <ins><p>Lorem ipsum</p></ins> and <p><ins>Lorem ipsum</ins></p> are valid.

If one would translate CriticMarkup into natural language I think this would also be a better, more natural solution. Lets say you have:

{++Lorem++}

If someone would describe this he/she would probably say "Insert a paragraph with the content 'Lorem'" (<ins><p>Lorem ipsum</p></ins>) and not "there is a paragraph into which the content 'Lorem'" has been inserted (<p><ins>Lorem ipsum</ins></p>). This seems to be nitpicking but it gets clearer with this example:

{++<div>Lorem</div>++}

In my opinion this says "add a div with the content 'lorem'" and therefore should lead to <ins><div>Lorem</div></ins> but it gives <p><ins><div>Test</div></ins></p>.

If you look at the W3C examples you'll find that they follow similar rules.

The hard part is probably that you'd have to keep track of the position of the CriticMarkup elements to be able to insert them before and after the block after Markdown processing of the block itself is done.

fletcher commented 7 years ago

@mn4367 -- thanks for the thoughts.

A while back, I spelled out a few thoughts of my own:

http://fletcher.github.io/MultiMarkdown-5/criticmarkup.html#myphilosophyoncriticmarkup

I'll add a few more in the hopes of sparking ideas (mine or others')...

MultiMarkdown 6 does offer some additional support for CM that occurs entirely within a Markdown block (e.g. paragraph, list item, etc.) Because the CM is more tightly integrated with Markdown, however, it doesn't "understand" CM that spans across multiple blocks (as per the examples you offer.) -- MMD 5 handled CM in a separate "first pass processing"

Because CM is not truly designed to be 100% compatible with the Markdown->HTML workflow, it will either be difficult or impossible to accurately reflect the semantic intent with valid HTML. When you thrown in the need to support other formats (LaTeX, ODF, etc.), it becomes even less likely a perfect solution will be developed.

Even something as simple as inserting a blank line inside of a table would create a slew of difficulties representing the semantic meaning.

I think CM is a fantastic idea, with a reasonably good implementation. As per my thoughts at the link above, if one sticks with the "true nature" of CM as being to allow commentary on the raw source of a document, it works perfectly. When producing the output HTML (or whatever), you "accept all changes" and get the appropriate finished output.

The difficulty only arises when you try to translate the idea of "raw source commentary" into "finished product with commentary still in place." To the best of my analysis over the years (having helped in a few small ways with the initial development of CriticMarkup, and having developed the only editor I know of with detailed built-in CM support), this is either not perfectly possible or will require a rather complicated tool to enable. I understand the allure of being able to do this, but that doesn't mean it is possible/easy.

My goal with MMD 6 is to try and find a sweet spot in the middle. I'm pleased with the "in-block" performance of CM support, and would like to find a means of supporting blank lines since that would be a relatively common use case (splitting or joining paragraphs). The difficulty here is that parsing across blocks requires a parser that looks ahead and backtracks, which can dramatically reduce performance.

For example:

{++

This could be

Multiple paragraphs

inside a CM addition

...

or not.

Until we arrived at the end of the document, we don't know whether a corresponding "++}" exists or not. If not, we would have to back all the way up to the "{++" and start parsing again understanding that we are not inside CM anymore.

There seem to be four ways to handle this:

  1. CM is a separate pre-processing step (like MMD 5). To me, this is actually the most "correct", since the CM syntax is orthogonal to MD syntax when you start considering edge cases. If you accept/reject changes, then the resulting file is valid Markdown/MultiMarkdown and can be properly parsed. The side effect is that you can't produce a "rough-draft final output with markup" Frankenstein hybrid of HTML with CM-based <ins> and <del> tags scattered all over. But you know that the final document will be valid.

  2. Use the step 1 approach twice -- once accepting the changes and once rejecting them. This gives you two valid HTML documents, but you now have to combine the two. Having done this, it's not really any easier than trying to handle it all at once, but at least you're dealing entirely with HTML (or whatever). You still have to deal with CM that might intersect HTML entities, making it difficult to create a valid combined HTML document.

  3. Ignore CM that crosses block boundaries. This is what MMD 6 currently does. I think this can be improved to expand support to certain "larger" constructs related to blank lines, but it will not cover every possible situation precisely (e.g. list items, tables, etc.)

  4. Spend the time to develop a truly "smart" algorithm that handles various edge cases. I don't think 100% coverage is possible with precise representation of the semantic meaning of all CM changes, but one could get close. I'm not sure "the juice is worth the squeeze" on this one though -- I suspect it would be a really complex tool that has to account for a large range of edge cases, and the performance is likely to be an issue when used for long, complex documents. I think such a tool would be separate from MMD itself for these reasons.

mn4367 commented 7 years ago

I've put together some ideas which I think could prevent a difficult forward/backward tracking solution but it needs a bit more handling on the block level. Considering your example:

{++

This could be

Multiple paragraphs

inside a CM addition

...

or not.

If we stick with the syntax of a CM element being the only markup in a line one could do the following: Remember the position of that element and that it is not closed but don't write anything to the output. If MMD finds a closing CM element in the consecutive parsing process with the same syntactic characteristics (it's the only markup in a line) then its clear where to write the opening and closing tag. If no such element is found then just write the closing tag at the end of the document. I know that I'm assuming here some kind of internal tree representation of the parsed Markdown elements (is there somehing like this?) and also that I'm not considering other output formats. If somebody is doing silly things like

{++

{++

...

or valid things like

{++

{>>

...

you'd need some sort of stack/list to keep the different opening positions and CM types but the principle is the same. I haven't tried it myself in code but I think it could be quite resilient against nesting problems.

With regard to splitting paragraphs I still think it could help if the strategy of treating CM elements as part of the block is changed to putting them outside such blocks but only if possible (like I proposed above, <ins><p>Lorem ipsum</p></ins> vs. <p><ins>Lorem ipsum</ins></p>).

This is possible in these five cases:

1:
{++Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua...++}
2:
{++Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua...
3:
++}Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua...
4:
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua...++}
5:
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua...{++

It's not possible in these two cases:

A:
{++Lorem ipsum dolor sit amet, consectetur adipisicing++} elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua...
B:
Ut enim ad minim veniam, quis nostrud {++exercitation ullamco laboris nisi 
ut aliquip ex ea commodo...++}

The parsing could be done following these rules:

  1. If the block is of category A or B handle it like it is done now
  2. If the block is of category 1 to 5 shift the CM element outside the block

Other CM elements could be treated with a stack/list like described above. Cases 1 to 5 can be considered as a special case/extension of CM elements which are isolated on a line, although they seem to be part of a block. Such CM elements are qualified, if they meet these conditions:

Effectively we are talking about CM elements with two different scopes: document scope and block scope.

Applying the procedure/rules to

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua...{++

++}Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi 
ut aliquip ex ea commodo...

should give the correct

<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua...<p><ins>

</ins><p>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi 
ut aliquip ex ea commodo...</p>

With regard to

- one
- {++inserted

    with some
    additional paragraph

    and another one++}
- two

the same procedure/rules could be applied but in this case it's not a document scope you'd need but instead a surrounding block scope (the list itself). So an erroneous

- one
- {++inserted

    with some
    additional paragraph

    and another one
- two

would just be auto-closed at the end of the list item. Handling and detection is the same as with CM elements outside lists.

I've checked with the W3C validator, the following things are not allowed, so they don't need to be considered:

- one
{++- inserted++}
- two

(ins and the like are not allowed inside ul/ol, only inside li)

|------|------|
| cell | cell |
{++| cell | cell |++}
| cell | cell |

(ins and the like are not allowed inside tables, only in td)

Other things:

...
Lorem

++}
...

Closing CM without a starting element is a non-fixable user error and could just be written verbatim to the output.

More:

Lorem ipsum dolor sit amet, {++consectetur adipisicing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua...

or

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
tempor incididunt ut labore et++} dolore magna aliqua...

are interesting. They could easily be auto-fixed to

Lorem ipsum dolor sit amet, {++consectetur adipisicing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua...++}
{++Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
tempor incididunt ut labore et++} dolore magna aliqua...

and therefore could be considered a nice feature but I'm not sure if this out of the scope of MMD (it is, isn't it? đŸ˜€ ).

fletcher commented 7 years ago

You'll probably want to take a look at the code before heading too deep down the rabbit hole of your suggestions.

A key design criteria for MMD-6 was performance. In looking around at parsers, it seems that the best tradeoff between simplicity and performance I could find was LALR(1) parsers. The lemon parser generator fit my needs -- it's lightweight and portable, the parser syntax is simple and designed to minimize risks of accidental errors, and the generated parser is high performance. There may be other parser generators that create faster parsers, but I suspect the incremental difference will be small.

LALR(1) parsers do not allow backtracking. This required me to do a couple of things:

  1. Nested lists are handled by recursive parsing. IMO, this makes for more intelligible code, so I was fine with that.

  2. Fenced code blocks run until the end of the document if there is no closing fence. You'll notice this is the same behavior as CommonMark. I don't know that CM's parser is LALR(1) (I believe it's hand-written, so may or may not fall cleanly into a specific classification).

I originally started by creating my own hand-written parser, but decided that using lemon would have a minimal performance hit (if any) and would make the overall code much easier to understand and maintain moving forward. parser.y uses an easy to understand structure. It's easy to add new features. It does require one to be thoughtful about syntax design, but that's a good thing.

The other thing to consider is Markdown's overall design philosophy. There are many things wrong with Gruber's Perl implementation, but I think the overall design of the Markdown syntax is pretty good, considering its purpose:

  1. Break the document into blocks (e.g. para's, headers, lists). Some blocks can have other blocks inside them, but a block is unaffected by anything outside the block.

  2. Within a block, identify and combine various tokens to create strong, emph, links, etc. Tokens are only affected by other tokens within the same block.

Fenced code blocks were the first widely adopted extension to the Markdown syntax that broke this design philosophy. A line of backticks in one place could be affected by another such line anywhere else in the document. It's sort of like Schrödinger's cat -- you don't know what a given line represents until the fence is closed or not closed. Allowing this sort of syntax causes all sorts of problems when dealing with Markdown in more complicated ways (such as MultiMarkdown Composer). It also requires more complicated (and therefore slower) parsers.

I'm not super enthusiastic about creating another quagmire by doing the same thing for CriticMarkup that was done with fenced code blocks.... ;)

fletcher commented 7 years ago

To be complete, I should mention that when using the -a or -r options (but not both simultaneously, which is the same using neither), blank lines do not cause a problem at all with CriticMarkup. That's because CM is processed "the right way", which is as a separate pass before the MMD is handled and converted to HTML (or whatever output format is chosen). A separate parser is used for this that ignores everything except for CriticMarkup.