HTML entity unescaping on parse can lead to unexpected results on serialize

ProseMirror / prosemirror-markdown

ProseMirror Markdown integration

https://prosemirror.net

MIT License

344 stars 81 forks source link

HTML entity unescaping on parse can lead to unexpected results on serialize #92

Closed segevfiner closed 1 year ago

segevfiner commented 1 year ago

markdown-it currently parses HTML entities into text, this leads to weird side effects on a parse/serialize cycle.

Not Bold will parse into Not Bold and serialize as Not Bold, which leads to it being bold in Markdown parsers that do honor HTML (Or a subset thereof).
&nbsp&nbspFoo parses as Foo and serializes as Foo, but will parse back as Foo, as Markdown trims normal spaces. The same applies to trailing &nbsp, which might be further confused as an hard break.
Typing < into prosemirror, serializes as < but parses as <.

All of these behaviors can be seen here https://prosemirror.net/examples/markdown/ and switching between Markdown and WYSIWYM.

marijnh commented 1 year ago

this leads to weird side effects on a parse/serialize cycle.

There is no guarantee of round-trips through parsing/serializing Markdown leaving the Markdown text unchanged.

That being said, I guess if you configure markdown-it to parse HTML tags, this can be problematic. But you'll need to configure your MarkdownSerializer for that anyway, and can use escapeExtraCharacters to backslash-escape & and <.

segevfiner commented 1 year ago

No need to keep it unchanged of course, but I'm trying not to change its actual meaning when parsed by parsers that do process HTML tags, rather than treat them as plain text. So stuff that was escaped before and needs to be escaped to avoid having special meaning needs to remain escaped.

marijnh commented 1 year ago

Have you tried escapeExtraCharacters? It'll only affect content that was parsed as regular text.

segevfiner commented 1 year ago

Since the HTML is parsed as regular text, this will mean any HTML tags that were not escaped in the original Markdown will become escaped on output, losing their meaning.

I think we might have to prevent unescaping of those characters at the markdown-it level to preserve them. e.g.

import type MarkdownIt from "markdown-it";
import type StateCore from "markdown-it/lib/rules_core/state_core";

const PRESERVE = new Set(["<", ">", "&", '"', "'"]);

function preserveSpecial(state: StateCore) {
  const blockTokens = state.tokens;

  for (let j = 0, l = blockTokens.length; j < l; j++) {
    if (blockTokens[j].type !== "inline") continue;

    // eslint-disable-next-line @typescript-eslint/no-non-null-assertion
    const tokens = blockTokens[j].children!;
    const max = tokens.length;

    for (let curr = 0; curr < max; curr++) {
      if (
        tokens[curr].type === "text_special" &&
        PRESERVE.has(tokens[curr].content)
      ) {
        tokens[curr].type = "text";
        tokens[curr].content = tokens[curr].markup;
      }
    }
  }
}

export default function preserveSpecialPlugin(md: MarkdownIt) {
  md.core.ruler.before("text_join", "preserve_entity", preserveSpecial);
}

marijnh commented 1 year ago

Since the HTML is parsed as regular text, this will mean any HTML tags that were not escaped in the original Markdown will become escaped on output, losing their meaning.

This can be said of any Markdown syntax extension that isn't turned on—that by normalizing the escapes we might be interfering with it. But accommodating for non-enabled features of any kind doesn't seem like a reasonable expectation of a markdown parser or serializer. So I think that adding this kind of extension to your configuration is your responsibility, if you want to treat HTML in this way (don't parse it, but preserve escapes).