commonmark / commonmark-java

Java library for parsing and rendering CommonMark (Markdown)
BSD 2-Clause "Simplified" License
2.27k stars 288 forks source link

Allow disabling wrapping <p> tags in one liners. #150

Closed dustindclark closed 2 days ago

dustindclark commented 5 years ago

Steps to reproduce the problem (provide example input)

Parser parser = Parser.builder().build()
HtmlRenderer renderer = HtmlRenderer.builder().build()
Node document = parser.parseReader('Here I have a test [link](http://www.google.com)')
String result = renderer.render(document)

Expected behavior:

result == "Here I have a test <a href="http://www.google.com">link</a>"

Actual behavior:

result == "<p>Here I have a test <a href="http://www.google.com">link</a></p>"

I understand that this may be a breaking change, so an option on HtmlRenderer on this would be sufficient.

robinst commented 5 years ago

Hey. Can you go into a bit more detail why you want that? I'm not sure the option is the best way to solve it. E.g., for markdown like this:

* test

There's no <p>, but a <ul>. Would you expect that to get stripped too? And if there's two paragraphs, you'd want to preserve the two <p>?

dustindclark commented 5 years ago

I guess the real question is why would I want the wrapping <p> tag when my input did not have any leading/trailing whitespace? Since I have no newlines in my input, I don't think you should assume that I want the default paragraph formatting on that text (block level, margins, etc). If I had provided a newline, then sure...a <p> tag makes sense. I think the HTML equivalent on a single line would be a <span> since a span does not have default CSS, but why wrap it all all in this case?

However, in the example you've mentioned, the text should definitely be wrapped with a <ul> because the unordered list was explicitly requested via markdown.

For backwards compatability though, I think adding a multiline vs single line flag on HtmlRenderer would be a good option.

robinst commented 5 years ago

Note that you can already achieve this with the current API. What you have to do is after parsing, change the nodes from a structure like this:

to this:

I've committed a test case that shows how to do that: https://github.com/atlassian/commonmark-java/commit/8b0fd7c73afaf1756edb6412c15a75cc423a6ba9

I'm pretty convinced that adding an option to HtmlRenderer is not the right way to solve this, because it would have to do a check when rendering a document for how many block-level nodes there are and then change how to render a paragraph later.

Maybe what we could add is support for parsing a single paragraph only (no block-level elements). But even that has a couple of edge cases depending on what you want to use it for.

dustindclark commented 5 years ago

Are you saying that libraries/apps that consume CommonMark should be aware of the HTML/DOM that is being generated and manipulate it accordingly? I couldn't disagree with that assessment more. That's extremely brittle and breaks every time HtmlRenderer changes. Also, why use markdown in the first place if I have to manipulate the DOM after the fact? I could also just use HTML to begin with.

The problem here is that CommonMark makes a bad assumption that a single line is a paragraph in the first place. If this were a new library, I'd say it's a bad design decision that you change up front by default by not adding the <p> tag for a single line. Since it is not a new library, I'm proposing the option on HtmlRenderer solely for backwards compatability.

robinst commented 5 years ago

Are you saying that libraries/apps that consume CommonMark should be aware of the HTML/DOM that is being generated and manipulate it accordingly?

They have to be aware of the AST (abstract syntax tree, node structure), yeah.

That's extremely brittle and breaks every time HtmlRenderer changes.

No, it doesn't. To be clear, what I'm proposing is that you do this:

The middle step is not dependent on how HtmlRenderer behaves. Not sure why you would think that?

We could add a utility method for this middle step, but I'm not sure about it yet. The method name would be something like unwrapParagraphIfDocumentContainsSingleParagraph(Node document)

The problem here is that CommonMark makes a bad assumption that a single line is a paragraph in the first place.

It's consistent with what the reference implementations commonmark.js and cmark do. I had a quick look and there's an open issue for the JS implementation here, but no replies yet: https://github.com/commonmark/commonmark.js/issues/152

epicstar commented 5 years ago

Note that you can already achieve this with the current API. What you have to do is after parsing, change the nodes from a structure like this:

  • document

    • paragraph

    • text

    • link

to this:

  • document

    • text
    • link

I've committed a test case that shows how to do that: 8b0fd7c

I'm pretty convinced that adding an option to HtmlRenderer is not the right way to solve this, because it would have to do a check when rendering a document for how many block-level nodes there are and then change how to render a paragraph later.

Maybe what we could add is support for parsing a single paragraph only (no block-level elements). But even that has a couple of edge cases depending on what you want to use it for.

I had the same concerns as OP, but after thinking about DOM manipulation as well as the unit test code, this was easy enough to figure out.

dustindclark commented 5 years ago

Again, consumers should not be aware of or rely on any specific HTML or structure that is generated by this library. I’m not arguing that it isn’t easy...it is incredibly easy. However, it’s not future proof and it’s brittle. It also completely defeats the purpose of using any rendering library (MD or otherwise) to have to post process and manipulate the results. Think of it like a translator library. If I call you to translate English to Spanish, should I manipulate the output or ask you for a specific dialect? It’s no different for MD to HTML.

I’m not sure how else to explain that this is a bad design decision. Coding around it is almost as painful as having to explain why.

samukce commented 4 years ago

If I call you to translate English to Spanish, should I manipulate the output or ask you for a specific dialect? It’s no different for MD to HTML

Reviving this thread. Taking your example @dustindclark , thinking about the translation context, I expect the translator does the translation without any extra step as you mean. So, bringing this example to MD to HTML, I also expect that the HTML translation been made fully wich means following some code conventions and good practices. That said, if we are asking to a lib translate for us a Markdown text to HTML, when we think in one single component in HTML is not a good practice to have not a tag defined, based on the HTML recommendations Use (always) meaningful tags

Let's assume that we implement this extension as you suggested. So, we would have something like:

  HtmlRenderer htmlRenderer = HtmlRenderer.builder().noParagraph().build();

  Document document = new Document();
  document.appendChild(parse("First Line [link](http://test.com)"));
  document.appendChild(parse("Second Line [link](http://foo.com)"));

  return htmlRenderer.render(document));

Generating an HTML such as

First Line <a href="http://test.com">link1</a>Second Line <a href="http://foo.com">link2</a>

In the end, we do have some HTML content that is not following the good practices because of this extra new configuration. If this is your needs, it seems so specialized that you can't skip some HTML conventions. if this is the case that would be where the @robinst example could be helpful.

I hope I added something to the discussion. :)

dustindclark commented 4 years ago

@samukce , first, there's absolutely nothing wrong with the generated HTML you've posted, and there's no HTML "good practice" that would indicate otherwise. It's perfectly valid syntactically, and that's exactly how I would craft HTML if I didn't want a line break/or space between links. Use (always) meaningful tags is nonsensical (all HTML tags mean something), so I don't know how to address that. Do you mean semantic tags? That's exactly my argument. The author of this library made the assumption that each line in a MD document is a paragraph, even for a single line. Semantically, if there's no newline, why would you assume that the content is a paragraph?

Second, while your example isn't really relevant because this feature request concerns one-liners, I don't think the example is a stretch of a use case. However, the rendered HTML that you presented is not what I would expect the result to look like. Your example has a newline in the document being processed, so I would expect that to be translated to HTML as a <br /> (since the noParagraph flag was set):

First Line <a href="http://test.com">link1</a><br />
Second Line <a href="http://foo.com">link2</a>

Third, I would disagree that this is a specialized use case. It's simply a use case, and again, none of the above HTML "skips HTML conventions".

samukce commented 4 years ago

Gotcha, thanks for share @dustindclark . So, assuming that you do have this valid html block, where would you inputs this html block? In a <div> container for instance?

robinst commented 4 years ago

@dustindclark How about you raise a PR to implement it? I'm curious what it would look like, and if we have code with tests it's much easier to talk about. I'm not opposed to adding something for this. My main question is, with the new option enabled, how would this be rendered?:

line 1

line 2
velldes commented 8 months ago

You have: Node document = parser.parse("This is *Sparta*"); renderer.render(document); // "<p>This is <em>Sparta</em></p>\n" expected without paragraph and without new line renderer.render(document); // "This is <em>Sparta</em>"

robinst commented 4 days ago

I've raised a PR to add an omitSingleParagraphP option to HtmlRenderer.Builder for this, see: https://github.com/commonmark/commonmark-java/pull/340