CommonMark/Markdown renderer (round-trip parsing/rendering)

robinst commented 9 years ago

It's useful to be able to render a parsed document back to CommonMark markdown. This is a tracker issue for issues relating to that, so far:

[x] Unable to disambiguate emphasis delimiter #10
[x] Unable to disambiguate list-item type #11

pcj commented 9 years ago

More currently unanswerable questions once the parse tree has been constructed:

[ ] Headers: was it ATX or Setext type?
[ ] Links: was it an inline, reference, or auto-link?

There are other metadata items for different types of nodes that would need to be captured to re-render a completely identical copy of the input text, things like:

[ ] The opener and closer sequences for certain items like headers, inline code, fenced code blocks (as these can be variable to some degree).
[ ] Whitespace padding between block elements (would be addressed by #1).

Finally, reference link definitions are swallowed up completely by the ParagraphParser, no AST nodes exist for them. Another strategy would be to leave those in somehow and put them in the visitor such that renderers have the option to deal with them. I'm guessing that this is not an easy issue to address since link reference defintions can exist in many block contexts (paragraph, list-item, blockquote), forcing all those to have to parse them.

[ ] Reference link definitions: define an AST node type.

PaulKlint commented 8 years ago

In the Rascal project (https://github.com/usethesource/rascal) I am using your very nice framework and have also felt the need for a round-trip back to CommonMark. Sofar, I have written an AnsiRenderer that maps back to text (and optionally to Ansi characters for highlighting). This is acceptable with a somewhat relaxed definition of round-trip but it breaks down when trying to apply this to the Tables extension. There are several issues in that case:

[ ] The general problem that all *Extension classes also need to implement AnsiRendererExtension (or whatever the round-trip renderer will be named)
[ ] More specifically, the meta-data to reconstruct the table header is problematically missing and cannot be easily approximated.

robinst commented 8 years ago

@PaulKlint: Cool!

Yeah, if you have external custom renderers for extensions, you either need to depend on the artifacts or provide additional artifacts for each extension. If it's a built-in renderer, we can just implement them for the extensions directly.

What specifically is missing from the metadata for table headers? Could you create a separate issue for that?

mattsheppard commented 7 years ago

Is there in fact an implementation of a CommonMark renderer (with whatever limitations there are as mentioned above), or is the intention only to develop one once they are overcome?

Or, perhaps, are you looking for one to be contributed?

PaulKlint commented 7 years ago

FYI: We have stopped exploring this line of development and have switched to http://asciidoctor.org (in particular the Java binding http://asciidoctor.org/docs/asciidoctorj/).

robinst commented 7 years ago

@mattsheppard: We'd welcome contributions for this, yes! Let me know if you want to work on this and I can provide some guidance. I think even a basic renderer would be nice to have already, we can address the other issues later on.

mattsheppard commented 7 years ago

Ok, thanks @robinst - maybe I ought to start by taking a more serious look at the existing renderers to see how much sense they make to me - thus far, I've only poked stuff from the outside.

Any quick pointers you have time to share would be great!

robinst commented 7 years ago

Cool. So you'd want to have something similar to TextContentRenderer, see the classes in the org.commonmark.renderer.text package. They are also setup in a way that allows extensions to contribute to rendering.

But instead of rendering a minimal plain text version (see CoreTextContentNodeRenderer), you'd want to render the original source instead.

ldrozdz commented 6 years ago

We have a partial implementation of a Markdown renderer in https://github.com/symphonyoss/messageml-utils/blob/master/src/main/java/org/symphonyoss/symphony/messageml/markdown/MarkdownRenderer.java. There is some organisation-specific code and the rendering is possibly incomplete, but I could look at adapting and contributing this for the core library.

dmlloyd commented 5 years ago

I'm looking to use CommonMark-Java on a new project, but not having round-trip parsing/rendering is a show-stopper. Any progress on this issue?

mattsheppard commented 5 years ago

I never got around to doing anything... though it’d still be handy to me so I’d be glad to try it if you developed it... or I may get to it someday :)

dmlloyd commented 5 years ago

I can't imagine it being too difficult, but the last thing I need is another 4+ hour task on my plate :) I guess there aren't any other good options. I can't afford to embed all of JRuby to get asciidoc.

seii commented 3 years ago

I spent some time over the past few months trying to get round-trip parsing/rendering working, and now have a working fork which passes all of Commonmark's test cases. It was not simple, and involved far more than a custom renderer! In order to do this I ended up needing to perform minor edits to many of the base classes. I can make almost all of it backward-compatible (if a little messy), but I need some guidance on which approach to compatibility this project prefers. What is the best way to figure that out?

EDIT: I should clarify that "roundtrip" in my fork takes the OP literally; that is, a document is converted from text to AST ("parsed") and can be converted back to the exact same text. I don't attempt to convert from HTML (which is already parsed and rendered) back to text.

robinst commented 3 years ago

Hey @seii! Nice, thanks for working on it, I know it's a tricky change. I had a look at the code in your fork, some initial comments to get this started:

It was a bit hard to follow due to code getting commented out and new one added, instead of modifying the existing code. If you modify the existing code it's much easier to understand what changes because there's a diff.
I would prefer not to make it conditional in the code (if (Parsing.IS_ROUNDTRIP)). If there's no big performance impact we should just add the additional information to the nodes by default. (DocumentRoundtripParser shouldn't be necessary, it should just be changes in DocumentParser.)
prepareContainerStartBlock and containerString: I didn't understand why it's necessary. Can you explain what problem it's solving? I'd like to have a proper solution for it instead. (I don't mind if we have to break the parser SPI for it.)
Instead of TextContentRoundtripRenderer I'd name it CommonMarkRenderer (roundtrip is one use case for that, but someone could actually create AST nodes directly and then use the renderer to produce commonmark)

seii commented 3 years ago

Hello @robinst! Sorry for the code blocks which were commented, I hadn't expected anyone to review my branch directly yet.

I have removed all commented-out sections and added comments of my own in several places. Hopefully this is a much cleaner experience.
I agree that if checks are not ideal. It is a temporary measure for now, mostly because I started "at the top" in my IDE with org.commonmark.internal.BlockContent and realized I had no way to decide which code to use. Regarding DocumentRoundtripParser, it currently begins by adding a newline to everything that parse(Reader input) returns, so it's fundamentally changing how lines are read compared to DocumentParser. I'm open to trying to integrate those two classes, but I wanted to start at a compatible place and increment instead of breaking everything. 🙂 (Also, with compatibility in mind, I have prioritized (!Parsing.IS_ROUNDTRIP) blocks first in all if statements to try and minimize performance impacts... but I don't have any benchmarks so I can't tell if it matters.)
I have added better comments to the containerString field and also to the prepareContainerStartBlock method. The short version is that container blocks were ending up double-printing their delimiters because the delimiters had been included as part of the content before. (Figuring out what to do about that was... educational!)
I have renamed the relevant rendering classes as you suggest. I agree that creating AST nodes directly would be great - I began this adaptation hoping to use the AST directly for a project of my own.

Aside from these items, I'd welcome any suggestions on how I could avoid the current if statements. Unifying the standard approach with my changes would be best, but I have done all work so far using CommonMark 0.29 tests as my validation and none of this project's extensions would have such tests because they're, y'know, not standard CommonMark. 🙂

robinst commented 3 years ago

it currently begins by adding a newline to everything that parse(Reader input) returns, so it's fundamentally changing how lines are read compared to DocumentParser

Hmm why is that?

(Also, with compatibility in mind, I have prioritized (!Parsing.IS_ROUNDTRIP) blocks first in all if statements to try and minimize performance impacts... but I don't have any benchmarks so I can't tell if it matters.)

To be clear, I don't want to merge/release this as is (with the if statements).

I'll try to do a more in-depth review and play around with the code at some point in the future, but can't promise anything.

seii commented 3 years ago

Hmm why is that?

I investigated a few ways to do roundtrip parsing, including an attempt to create my own. (That ended quickly!) Choosing to re-insert the newline which readLine() previously stripped allowed me to decide when extra newlines didn't belong in the AST (less common) instead of deciding when they needed to be added in (more common). This may just be a difference in how coding made sense to me, but once I took this approach I finished in a couple weeks what I hadn't managed in months.

To be clear, I don't want to merge/release this as is (with the if statements).

I'll try to do a more in-depth review and play around with the code at some point in the future, but can't promise anything.

That's fine, I wouldn't want to keep all the if statements either.

To be clear in return, I'm happy to look into unifying my changes in a way that HTML rendering continues working normally. As I said before, my primary worry is breaking the extensions. For the official portions of this library I can test against the CommonMark suite to make sure I haven't broken compatibility. Since the extension portions of this library are unofficial, I wouldn't be able to verify them that way. If you're open to the possibility that I may break extension compatibility, let me know.

robinst commented 3 years ago

You mean extension API compatibility? Yeah I'm fine if we have to break it in order to enable this feature. But once we have the changes I can also try to see if we can do it in a compatible way.

ericbottard commented 6 months ago

Hi! MarkdownRenderer is a nice feature (whether used in a roundtrip case or not). Do you plan to have a release that incorporates that addition soon? AFAICT 0.21.0 does not contain this new addition.

Thanks!

robinst commented 6 months ago

@ericbottard Yeah, planning the next release, just some other things I want to get in before that! Will update this issue once it's released.

robinst commented 6 months ago

This has now been released in 0.22.0: https://github.com/commonmark/commonmark-java/blob/main/CHANGELOG.md#0220---2024-03-15

While it still has limitations when it comes to round-tripping (see release notes above), I think we can close this issue now and open new ones for particular problems.

commonmark / commonmark-java

CommonMark/Markdown renderer (round-trip parsing/rendering) #12